------ Instructions ------
Before running the script, the following parameters need to be set within the python file:
- r1_len: length of read1 (use 40 for v3, and 52 for v4). Default: 52.
- r2_len: length of read2. Default: 255.
- check_pa_tail: whether to use read2 fastq file to filter out reads that don't have a poly(A) tail. Default: False.
- dis2T: the number of bases from the sequencing starts in read 2 to the 3' end of the mRNA (use 7 for v3, and 0 for v4). Default: 0.
- strand: positive strand for Tail-seq and negative strand for PAL-seq. Default: -.
- allow_back: whether to allow HMM to transition from a downstream state back to upstream state. Default: False.
- mixed_model: whether to use a Gausian mixed model (if not, a simple Gausian model is used). Default: True.
- r1_nor_start: starting position of read1 for signal normalization (use 10 for v3, and 20 for v4). Default: 20.
- r1_nor_end: ending position of read1 for signal normalization (use 35 for v3, and 50 for v4). Default: 50.
- bound: boundary for normalized log2 T_signal. Default: 5.
- all_zero_limit: limit for total number of all zeros in four channels in read 2. Default: 5.
- non_T_limit: limit for allowing non-T bases at the very 3' ends when calling tail length, due to possible uridylation. Default: 2.
- len_r2_T_filter: length of the begining region in read 2 (after 'dist2T') to check poly(T) (when 'check_pa_tail' is 'True'). Default: 8.
- ratio_r2_T_filter: minmal percentage of T in the begining region of read 2 (defined by 'len_r2_T_filter') to check poly(T) (when 'check_pa_tail' is 'True'). Default: 0.7.
- training_max: maximal number of clusters used in the training set. Default: 50000.
- training_min: minimal number of clusters used in the training set. Default: 5000.
- training_ratio: ratio of reads used for training, constrained by "training_max" and "training_min". Default: 0.01.
- n_threads: number of cores to use for multiprocess, if too big, memory may fail. Default: 20.
- chunk_lines: number of lines to allocate to each core to process, if too big, memory may fail. Default: 10000.
- chunk: give a feedback for proceessing this number of lines. Default: 1000000.
This script has three modes:
Mode 1:
- a text string which defines the prefix name of the output files (-n, required)
- a text string which specifies the output directory (-d, optional)
- a text string which indicates what to use for training HMM (-t, optional, default 'all')
- a reference file with annotation in bed format (-r, required)
- a STAR-aligned bam file (-b, required)
- the un-trimmed fastq file from read1 (-f1, required)
- the fastq file from read2 (-f2, optional, only if 'check_pa_tail' is 'True')
- intensity file from read2 (-i, required)
- poly(A) site annotation in bed format (-p, optional)
Mode 2:
- a text string which defines the prefix name of the output files (-n, required)
- a text string which specifies the output directory (-d, optional)
- a text string which indicates what to use for training HMM (-t, optional, default 'all')
- T_signal output file generated by this script (-s, required)
Mode 3:
- a text string which defines the prefix name of the output files (-n, required)
- a text string which specifies the output directory (-d, optional)
- T_signal output file generated by this script (-s, required)
- HMM model file output from ghmm (-m, required)
Note: When using LSF, make sure to add -n 20 for multiprocessing
It calculates poly(A) tail length based on a Gaussian (mixture) hidden markov model trained on a subset of the data It outputs these files:
- all poly(A) tags
- median poly(A) tail length and total number of tags for each gene
- mean poly(A) tail length and total number of tags for each gene
- HMM model
- A temporary file containing converted T-signal will be written out on the disk to save memory usage and it can be deleted in the end.
- A temporary file containing the HMM states of each read cycle for each cluster, which can be deleted manually.