# Documentation - piRNA pipe




## 1 Purpose  

For small RNA seqeuencing analysis (total, piwi-IP seq, etc). 




## 2 workflow  


### 2.1 pre-analysis  

+ Main steps 

  - remove structural RNAs (rRNA, tRNA, snRNA, snoRNA, ..., check rRNA%) 
  - remove miRNAs (miRNA hairpin, 0 mismatch, unique + multi)  
  - split by length ([23,29], others)  
  - map to TE consensus (0 mismatch, unique) 
  - map to piRNA clusters (0 mismatch, unique)  
  - map to genome (0 mismatch, unique + multi)  
  
+ Extra  

  - read size distribution  
  - evaluate 1U-10A   
  - check overlap between samples  
  
+ To-do  
  
  - extract siRNAs (21nt)  
  - ping-pong signature score 
  - correlation between samples
  - compare with piRBase (v2.0), other published data
 



## 3 Analysis  


### 3.1 main  

  - `piRNA_pipe.py`  
  
  
  
### 3.2 modules  

  - `fragsize.py` - calculate the distribution of read length  
  
  - `u1a10.py` - calculate the U1A10 content  
  
  - `fx_overlap.py` - check the overlap between two fastx file





### 3.3 Directory structure 

```
├── 00.raw_data
├── 01.clean_data
├── 02.collapse
├── 03.overlap
├── 04.smRNA
├── 05.miRNA
├── 06.size_exclude
├── 06.size_select
├── 07.te
├── 08.piRNA_cluster
├── 09.genome
├── 10.unmap
├── 11.stat
└── 12.report

```


### 3.4 Output files



 
## x. Change log




```
piRNA analysis (small RNAseq) 


date: 2020-12-27

1. support collapsed reads, counting, RNA seqs, reads


date: 2020-12-23
Author: Ming Wang

# flowchart of piRNA analysis
1. remove structural RNAs (uniqe + multi)
2. remove miRNAs (unique + multi) 
3. remove reads not in [23-29] nt 
4. collapse reads: consider 1-23nt only, allow 1-2 at 3' differ (2019-11-26)
5. split into 4 groups: (1U+/-,10A+/-;)
6. map to TE consensus, (unique + multi), only 1U_not_10A
7. map to genome (unique + multi) 
Functions:
rename fastq reads: piR0000001-0000001
piR (piRNA), (piRNA number) {reads number}
mapping, unique + multiple/unique


version: 2020-07-28
update:
1. remove temp fastq files
2. gzip fastq files

version: 2020-07-25
update:
1. collapse reads 


date: 2020-07-23
in brief:

1. remove 3' adapters  
2. remove structural RNAs (tRNAs, rRNAs) (unique + multiple)     
3. filt by length, 23-29 nt   
4. collapse reads, only compare 1-23 nt (*)  
   collapse reads, (regular)  
   trim reads to 23nt from 3' end  

5. map to TE consensus (unique, multiple, unique + multiple)   
6. map to piRNA clusters (unique, multiple, unique + multiple)   
7. map to genome (not-TE, no-piRNAcluster)    
8. Overall, map to genome
```





In [133]:
%reload_ext autoreload
%autoreload 2
from piRNA_pipe import *

args = {
    'fq': 'test/a.fq.gz',
    'outdir': 'test/results',
    'collapse': True,
    'subject': 'test/b.fq.gz',
    'force_overlap': True,
    'ov_type': 1,
}
# p = pipe(**args).run()

args = {
    'query': 'test/a.fq',
    'subject': 'test/b.fq',
    'outdir': 'test/overlap',
    'mm': 2
}
OverlapFq(**args).run()
# overlap_fq('test/a.fq', 'test/b.fq', 'test/overlap2')

[2021-03-03 23:39:38 INFO] run_shell_cmd: PID=3756605, PGID=3756605, CMD=bowtie-build -q --threads 4 test/overlap/index/b.fa test/overlap/index/b
[2021-03-03 23:39:39 INFO] run_shell_cmd: PID=3756669, PGID=3756669, CMD=bowtie -S -k 1 -v 2 -p 4 -l 23 -n 3 --no-unal -x test/overlap/index/b test/a.fq 2>test/overlap/a.align.log | samtools view -bhS - | samtools sort -o test/overlap/a.align.bam - && samtools index test/overlap/a.align.bam
0 0
AAAA-1
AAAA-2
0 0
AAAA-1
AAAA-2
0 0
AAAA-1
AAAA-2
AAAA-3
0 0
AAAA-1
AAAA-2
0 0
AAAA-1
AAAA-2
0 4
0 4
0 5
0 5
0 6
0 7
0 7
0 7
0 7
0 7
0 7
0 8
0 0
AAAA-1
AAAA-2
0 2
0 0
AAAA-1
AAAA-2
0 10
0 0
AAAA-1
AAAA-2
0 0
AAAA-1
AAAA-2
AAAA-3
0 0
AAAA-1
AAAA-2
0 2
0 0
AAAA-1
AAAA-2
0 0
AAAA-1
AAAA-2
AAAA-3
0 0
AAAA-1
AAAA-2
0 0
AAAA-1
AAAA-2
0 4
0 0
AAAA-1
AAAA-2
0 0
AAAA-1
AAAA-2
0 0
AAAA-1
AAAA-2
0 0
AAAA-1
AAAA-2
0 0
AAAA-1
AAAA-2
0 0
AAAA-1
AAAA-2
AAAA-3
0 0
AAAA-1
AAAA-2
AAAA-3
0 0
AAAA-1
AAAA-2
AAAA-3
0 0
AAAA-1
AAAA-2
AAAA-3
0 0
AAAA-1
AAAA-2
AAAA-3
0 0
AAAA-

'test/overlap/a.fq.gz'

In [126]:
import pyfastx

a = 'test/a.fq'
b = 'test/b.fq'
c = 'test/overlap2/a.fq'

# overlap-seqkit
db = [name for name, _, _ in pyfastx.Fastq(c, build_index=False)]


## sam
import pysam 
bam_in = 'test/overlap/a.align.bam'
bam_in2 = 'test/overlap/a.align.filt.bam'
bam_out = 'test/overlap/ov.bam'
b_in = pysam.AlignmentFile(bam_in)
b_in2 = pysam.AlignmentFile(bam_in2)
b_out = pysam.AlignmentFile(bam_out, 'wb', template=b_in)

# dx = [r.qname for r in b_in2]

i = 0
for r in b_in2:
#     i += 1
    if not r.qname in db:
        b_out.write(r)
        i += 1

i

591

In [99]:
da[1]

'ST-E00318:996:HFMLGCCX2:2:1101:16254:1766_GTGGATCTACTAT'

In [130]:
import pysam 

def check_align(x, self_mm=3, self_mm_range=[1,30]):
    """Check MD in sam record
    make sure MD:Z:20G0G6 
    mutations at 5-28 ranges
    mm <=3
    x AlignedSegment
    """
    f = False
    if isinstance(x, pysam.AlignedSegment):
        print(x.qstart, x.reference_start)
        if x.qstart == 0 and x.qstart == x.reference_start:
            md = [i[1] for i in x.tags if i[0] == 'MD'] # ['28G0G1'], ['29']
            if len(md) == 1:
                # mismatches ?!
                print('AAAA-1')
                mm = re.findall('[ACGT]', md[0])
                p = re.split('[ACGTN]', md[0]) # '[^0-9]'
                if len(mm) <= self_mm:
                    print('AAAA-2')
                    f = True # mismatches
                if len(p) > 1:
                    print('AAAA-3')
                    px = sum(list(map(int, p))[:-1]) # remove last one
                    f = px in range(self_mm_range[0], self_mm_range[1])
                else:
                    f = True # ?
    return f

sam = pysam.AlignmentFile('test/overlap/ov.bam')
r = next(sam)
# print(r.qname, r.reference_name)
check_align(r)

0 0
AAAA-1
AAAA-2
AAAA-3


True