# Documentation - piRNA pipe




## 1 Purpose  

For small RNA seqeuencing analysis (total, piwi-IP seq, etc). 




## 2 workflow  


### 2.1 pre-analysis  

+ Main steps 

  - remove structural RNAs (rRNA, tRNA, snRNA, snoRNA, ..., check rRNA%) 
  - remove miRNAs (miRNA hairpin, 0 mismatch, unique + multi)  
  - split by length ([23,29], others)  
  - map to TE consensus (0 mismatch, unique) 
  - map to piRNA clusters (0 mismatch, unique)  
  - map to genome (0 mismatch, unique + multi)  
  
+ Extra  

  - read size distribution  
  - evaluate 1U-10A   
  - check overlap between samples  
  
+ To-do  
  
  - extract siRNAs (21nt)  
  - ping-pong signature score 
  - correlation between samples
  - compare with piRBase (v2.0), other published data
 



## 3 Analysis  


### 3.1 main  

  - `piRNA_pipe.py`  
  
  
  
### 3.2 modules  

  - `fragsize.py` - calculate the distribution of read length  
  
  - `u1a10.py` - calculate the U1A10 content  
  
  - `fx_overlap.py` - check the overlap between two fastx file





### 3.3 Directory structure 

```
├── 00.raw_data
├── 01.clean_data
├── 02.collapse
├── 03.overlap
├── 04.smRNA
├── 05.miRNA
├── 06.size_exclude
├── 06.size_select
├── 07.te
├── 08.piRNA_cluster
├── 09.genome
├── 10.unmap
├── 11.stat
└── 12.report

```


### 3.4 Output files



 
## x. Change log




```
piRNA analysis (small RNAseq) 


date: 2020-12-27

1. support collapsed reads, counting, RNA seqs, reads


date: 2020-12-23
Author: Ming Wang

# flowchart of piRNA analysis
1. remove structural RNAs (uniqe + multi)
2. remove miRNAs (unique + multi) 
3. remove reads not in [23-29] nt 
4. collapse reads: consider 1-23nt only, allow 1-2 at 3' differ (2019-11-26)
5. split into 4 groups: (1U+/-,10A+/-;)
6. map to TE consensus, (unique + multi), only 1U_not_10A
7. map to genome (unique + multi) 
Functions:
rename fastq reads: piR0000001-0000001
piR (piRNA), (piRNA number) {reads number}
mapping, unique + multiple/unique


version: 2020-07-28
update:
1. remove temp fastq files
2. gzip fastq files

version: 2020-07-25
update:
1. collapse reads 


date: 2020-07-23
in brief:

1. remove 3' adapters  
2. remove structural RNAs (tRNAs, rRNAs) (unique + multiple)     
3. filt by length, 23-29 nt   
4. collapse reads, only compare 1-23 nt (*)  
   collapse reads, (regular)  
   trim reads to 23nt from 3' end  

5. map to TE consensus (unique, multiple, unique + multiple)   
6. map to piRNA clusters (unique, multiple, unique + multiple)   
7. map to genome (not-TE, no-piRNAcluster)    
8. Overall, map to genome
```





In [None]:
%reload_ext autoreload
%autoreload 2
from piRNA_pipe import *

args = {
    'fq': 'test/a.fq.gz',
    'outdir': 'test/results',
    'collapse': True,
    'subject': 'test/b.fq.gz',
    'force_overlap': True,
    'ov_type': 1,
}
# p = pipe(**args).run()

args = {
    'query': 'test/a.fq',
    'subject': 'test/b.fq',
    'outdir': 'test/overlap',
    'mm': 2
}
OverlapFq(**args).run()
# overlap_fq('test/a.fq', 'test/b.fq', 'test/overlap2')

In [None]:
# U1A10

%reload_ext autoreload
%autoreload 2
from fastx import *

fx = ['test/a.fq.gz', 'test/b.fq.gz']

FxU1A10(fx).run()


In [None]:
# Alignemt
%reload_ext autoreload
%autoreload 2
import pyfastx
from align import *

args = {
    'fx': 'test/a.fq.gz',
    'index': '/data/yulab/wangming/data/genome/dm6/bowtie_index/miRNA',
    'outdir': 'test/bbbbbb',
    'unique': 'multi'
}

Align(**args).run()


In [97]:
# PiRNApipe

%reload_ext autoreload
%autoreload 2
import pyfastx
from piRNA_pipe import *

args = {
    'fq': 'test/b.fq.gz',
    'outdir': 'test/aaaaaa',
    'genome': 'dm6',
    'subject_list': 'test/a.fq.gz'
}

p1 = PiRNApipe(**args)
p1.run()
p2 = overlap_subject(p1)


--------------------------------------------------------------------------------
             fq: /data/yulab/wangming/work/devel_pipeline/pirna_seq/scripts/test/b.fq.gz
         outdir: /data/yulab/wangming/work/devel_pipeline/pirna_seq/scripts/test/aaaaaa
       smp_name: b
         genome: dm6
        subject: None
        overlap: no
--------------------------------------------------------------------------------
Check the bowtie indexes:
    smRNA_index: ok     /home/wangming/data/genome/dm6/bowtie_index/smRNA
    miRNA_index: ok     /home/wangming/data/genome/dm6/bowtie_index/hairpin
       te_index: ok     /home/wangming/data/genome/dm6/bowtie_index/te
     piRC_index: ok     /home/wangming/data/genome/dm6/bowtie_index/piRNA_cluster
   genome_index: ok     /home/wangming/data/genome/dm6/bowtie_index/dm6
--------------------------------------------------------------------------------
[2021-03-05 10:02:33 INFO] 00.Copy raw data
[2021-03-05 10:02:33 INFO] symlink() skipped, dest ex

In [None]:
# stat
%reload_ext autoreload
%autoreload 2
from qc import *
import qc

x = 'test/aaaaaa/b'
y = x + '/00.raw_data'
df = PiRNApipeStat(x).run()
df

# pd.DataFrame(columns=['A', 'B'])

# s1 = os.path.join('test', 'fx_stat.reads.csv')
# df1 = df.drop(['num_seqs', 'fx_type'], axis=1)
# df1 = df1.pivot_table(columns='u1a10', values='num_reads',
#                index=['sample', 'group'])
# df1.reset_index(inplace=True)
# df1 = df1.fillna(0)
# df1.to_csv(s1, index=False)  