Skip to content

coffeebond/PAL-seq

Repository files navigation

------ Instructions ------

Before running the script, the following parameters need to be set within the python file:

  1. r1_len: length of read1 (use 40 for v3, and 52 for v4). Default: 52.
  2. r2_len: length of read2. Default: 255.
  3. check_pa_tail: whether to use read2 fastq file to filter out reads that don't have a poly(A) tail. Default: False.
  4. dis2T: the number of bases from the sequencing starts in read 2 to the 3' end of the mRNA (use 7 for v3, and 0 for v4). Default: 0.
  5. strand: positive strand for Tail-seq and negative strand for PAL-seq. Default: -.
  6. allow_back: whether to allow HMM to transition from a downstream state back to upstream state. Default: False.
  7. mixed_model: whether to use a Gausian mixed model (if not, a simple Gausian model is used). Default: True.
  8. r1_nor_start: starting position of read1 for signal normalization (use 10 for v3, and 20 for v4). Default: 20.
  9. r1_nor_end: ending position of read1 for signal normalization (use 35 for v3, and 50 for v4). Default: 50.
  10. bound: boundary for normalized log2 T_signal. Default: 5.
  11. all_zero_limit: limit for total number of all zeros in four channels in read 2. Default: 5.
  12. non_T_limit: limit for allowing non-T bases at the very 3' ends when calling tail length, due to possible uridylation. Default: 2.
  13. len_r2_T_filter: length of the begining region in read 2 (after 'dist2T') to check poly(T) (when 'check_pa_tail' is 'True'). Default: 8.
  14. ratio_r2_T_filter: minmal percentage of T in the begining region of read 2 (defined by 'len_r2_T_filter') to check poly(T) (when 'check_pa_tail' is 'True'). Default: 0.7.
  15. training_max: maximal number of clusters used in the training set. Default: 50000.
  16. training_min: minimal number of clusters used in the training set. Default: 5000.
  17. training_ratio: ratio of reads used for training, constrained by "training_max" and "training_min". Default: 0.01.
  18. n_threads: number of cores to use for multiprocess, if too big, memory may fail. Default: 20.
  19. chunk_lines: number of lines to allocate to each core to process, if too big, memory may fail. Default: 10000.
  20. chunk: give a feedback for proceessing this number of lines. Default: 1000000.

This script has three modes:

Mode 1:

  1. a text string which defines the prefix name of the output files (-n, required)
  2. a text string which specifies the output directory (-d, optional)
  3. a text string which indicates what to use for training HMM (-t, optional, default 'all')
  4. a reference file with annotation in bed format (-r, required)
  5. a STAR-aligned bam file (-b, required)
  6. the un-trimmed fastq file from read1 (-f1, required)
  7. the fastq file from read2 (-f2, optional, only if 'check_pa_tail' is 'True')
  8. intensity file from read2 (-i, required)
  9. poly(A) site annotation in bed format (-p, optional)

Mode 2:

  1. a text string which defines the prefix name of the output files (-n, required)
  2. a text string which specifies the output directory (-d, optional)
  3. a text string which indicates what to use for training HMM (-t, optional, default 'all')
  4. T_signal output file generated by this script (-s, required)

Mode 3:

  1. a text string which defines the prefix name of the output files (-n, required)
  2. a text string which specifies the output directory (-d, optional)
  3. T_signal output file generated by this script (-s, required)
  4. HMM model file output from ghmm (-m, required)

Note: When using LSF, make sure to add -n 20 for multiprocessing

It calculates poly(A) tail length based on a Gaussian (mixture) hidden markov model trained on a subset of the data It outputs these files:

  1. all poly(A) tags
  2. median poly(A) tail length and total number of tags for each gene
  3. mean poly(A) tail length and total number of tags for each gene
  4. HMM model
  5. A temporary file containing converted T-signal will be written out on the disk to save memory usage and it can be deleted in the end.
  6. A temporary file containing the HMM states of each read cycle for each cluster, which can be deleted manually.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages