-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
293 lines (243 loc) · 13.2 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
This pipeline is used to generate read counts and metrics from fastq reads. The code was optimized for infrastructure of Broad Institute. However, I think it can be adapted to other platform if required.
The pipeline can be started by calling the toplevel Python script called PipelineMain.py. From the help message:
usage: PipelineMain.py [-h] --config_file CONFIG_FILE [--key_id KEY_ID]
[--key_path KEY_PATH] [--project_ids PROJECT_IDS]
[--seq_id SEQ_ID] [--raw_seq_path RAW_SEQ_PATH]
[--temp_path TEMP_PATH] [--bam_path BAM_PATH]
[--results_path RESULTS_PATH] [--do_patho] [--do_host]
[--remove_splitted] [--gzip_merged]
[--host_aligner HOST_ALIGNER]
[--read_counter READ_COUNTER] [--no_p7] [--use_p5]
[--use_lane] [--use_seq_path] [--no_qsub] [--no_split]
[--no_merge] [--no_align] [--no_count] [--no_umi_count]
[--no_metrics] [--no_umi_metrics]
[--no_replace_refname] [--no_expand] [--no_bc_split]
[--no_ref] [--Suffix_s1 SUFFIX_S1]
[--Suffix_s2 SUFFIX_S2] [--Suffix_ne SUFFIX_NE]
[--ADD5 ADD5] [--ADD3 ADD3] [--MOC_id MOC_ID]
[--trim_rs_5p TRIM_RS_5P] [--trim_rs_3p TRIM_RS_3P]
[--keep_rs_5p KEEP_RS_5P] [--keep_rs_3p KEEP_RS_3P]
[--trim_r1_5p TRIM_R1_5P] [--trim_r1_3p TRIM_R1_3P]
[--trim_r2_5p TRIM_R2_5P] [--trim_r2_3p TRIM_R2_3P]
[--keep_r1_5p KEEP_R1_5P] [--keep_r1_3p KEEP_R1_3P]
[--keep_r2_5p KEEP_R2_5P] [--keep_r2_3p KEEP_R2_3P]
[--MOC_id_ref MOC_ID_REF] [--AllSeq_con_A ALLSEQ_CON_A]
[--AllSeq_trim_len ALLSEQ_TRIM_LEN] [--no_login_name]
[--min_resource] [--count_strand_rev COUNT_STRAND_REV]
[--do_bestacc] [--use_sample_id] [--bwa_mem]
[--rm_rts_dup] [--do_rerun] [--paired_only_patho]
[--paired_only_host] [--ucore_time UCORE_TIME]
Process the options.
optional arguments:
-h, --help show this help message and exit
--config_file CONFIG_FILE, -c CONFIG_FILE
Path to main config file
--key_id KEY_ID Key file would be "key_id"_key.txt
--key_path KEY_PATH Key file path (absolute)
--project_ids PROJECT_IDS
Project ids
--seq_id SEQ_ID Sequencing id used to make raw_seq_path
--raw_seq_path RAW_SEQ_PATH
Directory for raw sequence files (absolute)
--temp_path TEMP_PATH
Will contain the temporary results
--bam_path BAM_PATH Will contain the aligned sorted bam files
--results_path RESULTS_PATH
Will contain the path to the results
--do_patho Do the patho
--do_host Do the host
--remove_splitted Remove the splitted files
--gzip_merged Compress the merged files to gzip format
--host_aligner HOST_ALIGNER
Aligner for host (BBMap)
--read_counter READ_COUNTER
Tool for counting reads (default: Counter written by
JLivny)
--no_p7 Use if P7 index is not used.
--use_p5 Use if P5 index is used.
--use_lane Use if lane specific merging is required.
--use_seq_path Use if direct mapping of sample to raw seq path is
required.
--no_qsub Does not submit qsub jobs.
--no_split Does not split the fastq files.
--no_merge Does not merge the split fastq files.
--no_align Does not align.
--no_count Does not count.
--no_umi_count Does not collapse/compute UMI.
--no_metrics No metrics generation.
--no_umi_metrics Does not compute UMI metrics.
--no_replace_refname Does not replace reference name in fna
--no_expand Does not expand p7 and barcode entries in the key file
if required.
--no_bc_split Does not run the barcode splitter, instead create
softlinks of the raw seq files to the split directory.
--no_ref Does not generate patho ref.
--Suffix_s1 SUFFIX_S1
Update the value of Suffix_s1
--Suffix_s2 SUFFIX_S2
Update the value of Suffix_s2
--Suffix_ne SUFFIX_NE
Update the value of Suffix_ne
--ADD5 ADD5 ADD5 for gff parser
--ADD3 ADD3 ADD3 for gff parser
--MOC_id MOC_ID Provide MOC string for adding MOC hierarchy
--trim_rs_5p TRIM_RS_5P
5p trim count for read-single
--trim_rs_3p TRIM_RS_3P
3p trim count for read-single
--keep_rs_5p KEEP_RS_5P
5p keep count for read-single
--keep_rs_3p KEEP_RS_3P
3p keep count for read-single
--trim_r1_5p TRIM_R1_5P
5p trim count for read1
--trim_r1_3p TRIM_R1_3P
3p trim count for read1
--trim_r2_5p TRIM_R2_5P
5p trim count for read2
--trim_r2_3p TRIM_R2_3P
3p trim count for read2
--keep_r1_5p KEEP_R1_5P
5p keep count for read1
--keep_r1_3p KEEP_R1_3P
3p keep count for read1
--keep_r2_5p KEEP_R2_5P
5p keep count for read2
--keep_r2_3p KEEP_R2_3P
3p keep count for read2
--MOC_id_ref MOC_ID_REF
Adds MOC_id_ref to Bacterial_Ref_path
--AllSeq_con_A ALLSEQ_CON_A
Minimum number of consecutive A required to trim
AllSeq read.
--AllSeq_trim_len ALLSEQ_TRIM_LEN
Minimum length of a read required to keep it after
AllSeq trim
--no_login_name Generate results in a username specific directory
--min_resource Request for minimum resource during unicore allocation
--count_strand_rev COUNT_STRAND_REV
Reverse the counting mechanism by making it "forward"
--do_bestacc Run bestacc after splitting and merging
--use_sample_id Use sample id for sample id mapping
--bwa_mem Use bwa mem instead of bwa backtrack
--rm_rts_dup Remove PCR duplicates from RNA-tagseq
--do_rerun Rerun pipeline after a previous incomplete run
--paired_only_patho Align/count only the reads when both ends are mapped
on pathogen side
--paired_only_host Align/count only the reads when both ends are mapped
on host side
--ucore_time UCORE_TIME
Timelimit of unicore for this run
Here are brief descriptions of other directories of the pipeline.
Directory "other"
---------------------------------
The "other" folder contains several components of the pipelines.
BarcodeSplitter subfolder
...........................
(1) BarcodeSplitter: This remove the inline barcodes from the reads (both RNATagSeq and AllSeq protocols) and bin them to several files. Each of the binned files contains reads with only same inline barcodes.
To execute barcode splitter for RNA-tagseq,
bc_splitter_rts -d <dict_file> --file1 <file1> --file2 <file2> -p <prefix_str> -o <outdir>
where,
dict_file - the dictionary file
file1 - first fastq file (first of the pair, read 1)
file2 - second fastq file (second of the pair, read 2)
prefix - prefix string used to prepend to all output files
output - output directory
The help message provides optional parameters of the executable.
-h [ --help ] produce help message
-d [ --dict-file ] arg Dictionary file
--file1 arg First file
--file2 arg Second file
-p [ --prefix ] arg Prefix string
-o [ --outdir ] arg Output directory
-k [ --keep_last ] Optional/Do use last base of barcode (RNATag-Seq)
--ha Optional/Highlight all barcodes
--bc-all arg Optional/File of all barcodes
--bc-used arg Optional/File of used barcodes, one number per
line
-m [ --mismatch ] arg (=1) Optional/Maximum allowed mismatches.
--allowed-mb arg (=2048) Optional/Estimated memory requirement in MB.
(2) To execute barcode splitter for the AllSeq protocol the executable bc_splitter is similar as that of bc_splitter_rts in terms of mandatory parameters. However, the optional parameters are different due to starting location and length of UMI and barcode. From help message of bc_splitter,
./bc_splitter -h
-h [ --help ] produce help message
-d [ --dict-file ] arg Dictionary file
--file1 arg First file
--file2 arg Second file
-p [ --prefix ] arg Prefix string
-o [ --outdir ] arg Output directory
-t [ --type ] arg (=allseq) Optional allseq/rnatagseq
--ha Optional/Highlight all barcodes
--bc-all arg Optional/File of all barcodes
--bc-used arg Optional/File of used barcodes, one number per
line
-m [ --mismatch ] arg (=1) Optional/Maximum allowed mismatches.
--bc-start arg (=6) Optional/Barcode start position
--bc-size arg (=6) Optional/Barcode size
--umi-start arg (=0) Optional/Umi start position
--umi-size arg (=6) Optional/Umi size
--allowed-mb arg (=2048) Optional/Estimated memory requirement in MB.
Usage: bc_splitter -d <dict_file> --file1 <file1> --file2 <file2> -p <prefix_str> -o <outdir>
(3) index_splitter is used to remove P7 index barcode from fastq reads. This program can be run by calling a python script called IndexSplitterMain.py,
IndexSplitterMain.py -d <dict_file> -i <input_dir> -o <output_dir>
where,
dict_file - dictionary file containing P7 indices
input_dir - input fastq files
output_dir - results of the run
--------------------------------------
Sub-directory "barcodes"
--------------------------------------
Contains barcodes for the pipeline.
Nextera_i5_indices.txt - list of i5 barcodes used for dual index splitter
Nextera_i7_indices.txt - list of i7 barcodes used for dual index splitter
allseq_bcs.txt - list of AllSeq inline barcode
p7_barcodes.txt - list of P7 barcodes used for index splitter
rts_bcs.txt - inline barcodes for RNA-tagseq
Sub-directory "configs"
---------------------------
A sample configuration file used to set environment variables before the pipeline is submitted.
Sub-directory "dual_indexed_bc"
-------------------------------
Dual index barcode splitter used to remove both i7 and i5 barcodes from the reads. This can be executed by calling the top level python script such as:
dual_indexed_bc/IndexSplitterMain.py -i <input dir> -o <output dir> --dict1 <dictionary file i7> --dict2 <dictionary file i5>
From the help message,
usage: IndexSplitterMain.py [-h] --indir INDIR --outdir OUTDIR --dict1_file
DICT1FILE --dict2_file DICT2FILE
[--suffix_s1 SUFFIX_S1] [--suffix_s2 SUFFIX_S2]
[-m MEMORY] [--no_qsub] [--total_run TOTAL_RUN]
[--run_time_ind RUN_TIME_IND]
Process inputs for index splitter
optional arguments:
-h, --help show this help message and exit
--indir INDIR, -i INDIR
Input directory (default: None)
--outdir OUTDIR, -o OUTDIR
Output directory (default: None)
--dict1_file DICT1FILE
index 1 dict file (default: None)
--dict2_file DICT2FILE
index 2 dict file (default: None)
--suffix_s1 SUFFIX_S1
--suffix_s2 SUFFIX_S2
-m MEMORY, --memory MEMORY
Amount of memory to allocate (default: 8000)
--no_qsub Does not submit qsub jobs. (default: True)
--total_run TOTAL_RUN
Run for top read_count number of reads (if -1, run for
all reads) (default: -1)
--run_time_ind RUN_TIME_IND
Run time for individual jobs (default: 24 hours)
(default: 24)
Sub-directory "read_trimmer"
-------------------------------
It contains a script called "read_trimmer" used to trim fastq reads on 5-prime and 3-prime sides.
Usage: combine_lanes -i <infile> -o <outfile> --l <logfile> (optional) --trim_5p <number of bases to trim on 5p side> --trim_3p <number of bases to trim on 3p side> --keep_5p <number of bases to keep from 5p side> --keep_3p <number of bases to keep from 3p side>
From the help message,
read_trimmer -h
-h [ --help ] generate help message
-i [ --infile ] arg path to the input file
-o [ --outfile ] arg path to the output file
- [ --logfile ] arg Path to the logfile
--trim_5p arg (=0) number of bases to trim from 5p
--trim_3p arg (=0) number of bases to trim from 3p
--keep_5p arg (=-1) number of bases to keep from the 5p
--keep_3p arg (=-1) number of bases to keep from the 3p