1. Confirm where are reads are on the gene (plot, from most 3' UTR to least 3' UTR). Can we support this type of study scRNA-seq wise: https://www.ncbi.nlm.nih.gov/pubmed/30290838 ?
2. Look at the RNA and find deviations from the refrence.
3. Look at the DNA and filter out all somatic/germline mutations from the RNA editing.
4. Looking at germline/somatic mutations. Can we confirm patient identity with the germline mutations?
5. What's the ratio of germline/somatic mutations? Can we see somatic mutation in RNA.
6. Can we see any spatial differences in cancers with various subclones and 3' UTR mRNA editing?

Interesting reading: https://www.ncbi.nlm.nih.gov/pubmed/24289319

In [123]:
import pysam
import collections
import time

bam_filename = "5k_pbmc_protein_v3_calmd.bam"
result_filename = "5k_pbmc_protein_v3_calmd_gene_long_intron_removed.bam"

limit = -1

# Filter out all reads which don't map to a gene

In [21]:
def read_ref_seq(aligned_pairs, read_seq):
    result = []
    for read_coord, ref_coord, reference_base in aligned_pairs:
        try:
            read_base = read_seq[read_coord]
        except TypeError as e:
            read_base = None
        result.append((read_base, reference_base))
    return result

In [28]:
flag_counter = collections.Counter()
total_reads = 0
after_duplicates = 0
after_genes_only = 0
good_reads = 0
genes = set()

with pysam.AlignmentFile(bam_filename, "rb") as f:
    with pysam.AlignmentFile(result_filename, "wb", template=f) as g:
        for i, read in enumerate(f):
            if i == limit:
                break
            
            total_reads += 1
            
            flag_counter[read.flag] += 1
            
            if read.flag & 0x900 != 0:
                continue
            
            after_duplicates += 1
            
            tags = {i:j for i, j in read.get_tags()}
            try:
                genes.add(tags["GN"])
            except KeyError:
                continue
            
            after_genes_only += 1
            
            long_intron = False
            for code, length in read.cigartuples:
                if code == 3 and length > 10000:
                        long_intron = True
            if long_intron:
                continue
            
            good_reads += 1
            g.write(read)

In [29]:
print("Reads flags:", flag_counter)
print("Total reads:", total_reads)
print("Reads after secondary alignment removal:", after_duplicates)
print("Reads after discarding whatever didn't map to a gene:", after_genes_only)
print("Reads after discarding all reads with long introns:", good_reads)

Reads flags: Counter({0: 59179732, 256: 42396755, 16: 42268302, 4: 33264525, 1024: 25415291, 1040: 16208331, 1028: 14491834, 272: 12184627})
Total reads: 245409397
Reads after secondary alignment removal: 190828015
Reads after discarding whatever didn't map to a gene: 83212367
Reads after discarding all reads with long introns: 82786316


```
Reads flags: Counter({0: 59179732, 256: 42396755, 16: 42268302, 4: 33264525, 1024: 25415291, 1040: 16208331, 1028: 14491834, 272: 12184627})
Total reads: 245409397
Reads after secondary alignment removal: 190828015
Reads after discarding whatever didn't map to a gene: 83212367
Reads after discarding all reads with long introns: 82786316
```

In [38]:
print("Fraction of reads written to filtered file", good_reads/total_reads)

Fraction of reads written to filtered file 0.33733963333115563


```
Fraction of reads written to filtered file 0.33733963333115563
```

In [30]:
print("Number of genes", len(genes))

Number of genes 22766


```
Number of genes 22766
```

In [32]:
!samtools index 5k_pbmc_protein_v3_calmd_gene_long_intron_removed.bam

# Find RNA editing evets, SNPs and SNVs

In [81]:
def read_base_reference(read):
    aligned_pairs = read.get_aligned_pairs(with_seq=True)
    read_seq = read.seq
    result = {}
    for read_coord, ref_coord, reference_base in aligned_pairs:
        try:
            read_base = read_seq[read_coord]
        except TypeError as e:
            read_base = None
        result[read_coord] = (read_base, reference_base)
    return result

In [117]:
start = time.time()
samfile = pysam.AlignmentFile(result_filename, "rb")
for pileupcolumn in samfile.pileup("1", 1000000, 1300000):
    disagreement = collections.Counter()
    total = 0
    for pileupread in pileupcolumn.pileups:
        if not pileupread.is_del and not pileupread.is_refskip:
            total += 1
            read = pileupread.alignment
            read_base, reference_base = read_base_reference(read)[pileupread.query_position]
            if read_base != reference_base:
                disagreement[(reference_base, read_base)] += 1
    if disagreement and total > 5:
        max_disagree = max(disagreement.values())
        if max_disagree > 3 and max_disagree >= 0.1 * total:
            print(pileupcolumn.pos, disagreement, total)
stop = time.time()

1000078 Counter({('a', 'G'): 9}) 9
1013540 Counter({('t', 'C'): 12}) 12
1014273 Counter({('a', 'G'): 2267}) 2267
1014536 Counter({('g', 'A'): 15, ('g', 'T'): 1}) 89
1081960 Counter({('g', 'T'): 33}) 33
1223250 Counter({('a', 'G'): 34}) 34
1291863 Counter({('g', 'A'): 8}) 16
1292516 Counter({('a', 'G'): 22}) 48


In [118]:
print(stop - start)

28.734182119369507


How many hours will it take to run the RNA modifications/ editing script

In [121]:
(82786316/(1300000 - 1000000) * 28.734182119369507) / 3600

2.2025898897552536