1. Confirm where are reads are on the gene (plot, from most 3' UTR to least 3' UTR). Can we support this type of study scRNA-seq wise: https://www.ncbi.nlm.nih.gov/pubmed/30290838 ?
2. Look at the RNA and find deviations from the refrence.
3. Look at the DNA and filter out all somatic/germline mutations from the RNA editing.
4. Looking at germline/somatic mutations. Can we confirm patient identity with the germline mutations?
5. What's the ratio of germline/somatic mutations? Can we see somatic mutation in RNA.
6. Can we see any spatial differences in cancers with various subclones and 3' UTR mRNA editing?

Interesting reading: https://www.ncbi.nlm.nih.gov/pubmed/24289319

In [123]:
import pysam
import collections
import time

bam_filename = "5k_pbmc_protein_v3_calmd.bam"
result_filename = "5k_pbmc_protein_v3_calmd_gene_long_intron_removed.bam"

limit = -1

# Filter out all reads which don't map to a gene

In [21]:
def read_ref_seq(aligned_pairs, read_seq):
    result = []
    for read_coord, ref_coord, reference_base in aligned_pairs:
        try:
            read_base = read_seq[read_coord]
        except TypeError as e:
            read_base = None
        result.append((read_base, reference_base))
    return result

In [28]:
flag_counter = collections.Counter()
total_reads = 0
after_duplicates = 0
after_genes_only = 0
good_reads = 0
genes = set()

with pysam.AlignmentFile(bam_filename, "rb") as f:
    with pysam.AlignmentFile(result_filename, "wb", template=f) as g:
        for i, read in enumerate(f):
            if i == limit:
                break
            
            total_reads += 1
            
            flag_counter[read.flag] += 1
            
            if read.flag & 0x900 != 0:
                continue
            
            after_duplicates += 1
            
            tags = {i:j for i, j in read.get_tags()}
            try:
                genes.add(tags["GN"])
            except KeyError:
                continue
            
            after_genes_only += 1
            
            long_intron = False
            for code, length in read.cigartuples:
                if code == 3 and length > 10000:
                        long_intron = True
            if long_intron:
                continue
            
            good_reads += 1
            g.write(read)

In [29]:
print("Reads flags:", flag_counter)
print("Total reads:", total_reads)
print("Reads after secondary alignment removal:", after_duplicates)
print("Reads after discarding whatever didn't map to a gene:", after_genes_only)
print("Reads after discarding all reads with long introns:", good_reads)

Reads flags: Counter({0: 59179732, 256: 42396755, 16: 42268302, 4: 33264525, 1024: 25415291, 1040: 16208331, 1028: 14491834, 272: 12184627})
Total reads: 245409397
Reads after secondary alignment removal: 190828015
Reads after discarding whatever didn't map to a gene: 83212367
Reads after discarding all reads with long introns: 82786316


```
Reads flags: Counter({0: 59179732, 256: 42396755, 16: 42268302, 4: 33264525, 1024: 25415291, 1040: 16208331, 1028: 14491834, 272: 12184627})
Total reads: 245409397
Reads after secondary alignment removal: 190828015
Reads after discarding whatever didn't map to a gene: 83212367
Reads after discarding all reads with long introns: 82786316
```

In [38]:
print("Fraction of reads written to filtered file", good_reads/total_reads)

Fraction of reads written to filtered file 0.33733963333115563


```
Fraction of reads written to filtered file 0.33733963333115563
```

In [30]:
print("Number of genes", len(genes))

Number of genes 22766


```
Number of genes 22766
```

In [32]:
!samtools index 5k_pbmc_protein_v3_calmd_gene_long_intron_removed.bam

# Find RNA editing evets, SNPs and SNVs

In [81]:
def read_base_reference(read):
    aligned_pairs = read.get_aligned_pairs(with_seq=True)
    read_seq = read.seq
    result = {}
    for read_coord, ref_coord, reference_base in aligned_pairs:
        try:
            read_base = read_seq[read_coord]
        except TypeError as e:
            read_base = None
        result[read_coord] = (read_base, reference_base)
    return result

In [135]:
start = time.time()
samfile = pysam.AlignmentFile(result_filename, "rb")
for pileupcolumn in samfile.pileup("1", 1000000, 1900000):
    disagreement = collections.Counter()
    total = 0
    for pileupread in pileupcolumn.pileups:
        if not pileupread.is_del and not pileupread.is_refskip:
            total += 1
            read = pileupread.alignment
            read_base, reference_base = read_base_reference(read)[pileupread.query_position]
            if read_base != reference_base:
                disagreement[(reference_base, read_base, read.is_reverse)] += 1
    if disagreement and total > 5:
        max_disagree = max(disagreement.values())
        if max_disagree > 3 and max_disagree >= 0.1 * total:
            print(pileupcolumn.reference_name, pileupcolumn.pos, disagreement, total)
            
stop = time.time()

1 1000078 Counter({('a', 'G', True): 9}) 9
1 1013540 Counter({('t', 'C', False): 12}) 12
1 1014273 Counter({('a', 'G', False): 2267}) 2267
1 1014536 Counter({('g', 'A', False): 15, ('g', 'T', False): 1}) 89
1 1081960 Counter({('g', 'T', True): 33}) 33
1 1223250 Counter({('a', 'G', True): 34}) 34
1 1291863 Counter({('g', 'A', False): 8}) 16
1 1292516 Counter({('a', 'G', True): 22}) 48
1 1301655 Counter({('t', 'C', True): 21}) 47
1 1352964 Counter({('a', 'G', True): 7}) 7
1 1395345 Counter({('a', 'G', True): 6}) 6
1 1401953 Counter({('g', 'T', True): 204}) 206
1 1407231 Counter({('g', 'C', True): 74}) 75
1 1468855 Counter({('c', 'T', False): 8}) 9
1 1496121 Counter({('g', 'A', False): 60}) 61
1 1497604 Counter({('g', 'C', False): 5}) 6
1 1497605 Counter({('c', 'G', False): 5}) 6
1 1613973 Counter({('g', 'A', True): 30}) 54
1 1616546 Counter({('t', 'C', False): 4}) 7
1 1617053 Counter({('c', 'A', False): 11}) 77
1 1617055 Counter({('c', 'T', False): 10}) 77
1 1617056 Counter({('t', 'A', F

In [128]:
pileupcolumn.reference_name
#dir(pileupcolumn)

'1'

In [129]:
dir(pileupcolumn)

['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'get_mapping_qualities',
 'get_num_aligned',
 'get_query_names',
 'get_query_positions',
 'get_query_qualities',
 'get_query_sequences',
 'n',
 'nsegments',
 'pileups',
 'pos',
 'reference_id',
 'reference_name',
 'reference_pos',
 'set_min_base_quality',
 'tid']

In [118]:
print(stop - start)

28.734182119369507


How many hours will it take to run the RNA modifications/ editing script

In [121]:
(82786316/(1300000 - 1000000) * 28.734182119369507) / 3600

2.2025898897552536

In [134]:
read.get_aligned_pairs(with_seq=True)

[(0, 1299996, 'G'),
 (1, 1299997, 'G'),
 (2, 1299998, 'G'),
 (3, 1299999, 'G'),
 (4, 1300000, 'T'),
 (5, 1300001, 'C'),
 (6, 1300002, 'C'),
 (7, 1300003, 'A'),
 (8, 1300004, 'G'),
 (9, 1300005, 'C'),
 (10, 1300006, 'T'),
 (11, 1300007, 'G'),
 (12, 1300008, 'G'),
 (13, 1300009, 'T'),
 (14, 1300010, 'G'),
 (15, 1300011, 'C'),
 (16, 1300012, 'A'),
 (17, 1300013, 'G'),
 (18, 1300014, 'G'),
 (19, 1300015, 'A'),
 (20, 1300016, 'G'),
 (21, 1300017, 'G'),
 (22, 1300018, 'C'),
 (23, 1300019, 'T'),
 (24, 1300020, 'G'),
 (25, 1300021, 'T'),
 (26, 1300022, 'A'),
 (27, 1300023, 'G'),
 (28, 1300024, 'C'),
 (29, 1300025, 'C'),
 (30, 1300026, 'C'),
 (31, 1300027, 'T'),
 (32, 1300028, 'G'),
 (33, 1300029, 'C'),
 (34, 1300030, 'T'),
 (35, 1300031, 'G'),
 (36, 1300032, 'G'),
 (37, 1300033, 'A'),
 (38, 1300034, 'A'),
 (39, 1300035, 'G'),
 (40, 1300036, 'A'),
 (41, 1300037, 'A'),
 (42, 1300038, 'G'),
 (43, 1300039, 'C'),
 (44, 1300040, 'T'),
 (45, 1300041, 'G'),
 (46, 1300042, 'G'),
 (47, 1300043, 'A'),
 (