lg3 test validate
now compares the MD5 checksums of the content of the trimmed FASTQ files, the BWA aligned BAM files and the corresponding BAM index files. This requires that MD5 checksum files can be written.
-
The 'java', 'python' and 'Rscript' executables are now set in a central location to guarantee that the same, expected versions are used everywhere.
-
The AnnoVar, bedtools, cutadapt, BWA, GATK, MuTect, Picard, and Samtools executables are now set in a central location to guarantee that the same, expected versions are used everywhere.
- The LG3 Pipeline requires Python 2. If an incompatible Python version is detected on the 'PATH', then an informative error is produced.
- The 'lg3.conf' file must not be edited while the pipeline is running. If done, then the outcome and the results are unpredictable. This is because the 'lg3.conf' file is not frozen when the pipeline is launched.
- The LG3 Pipeline now refuses to run from within its installation folder, i.e. when the current working directory equals '${LG3_HOME}'. This protects against various potential mistakes such as overriding installed files and settings.
-
Globals settings for the LG3 Pipeline such as locations of software tools are now configured in the '${LG3_HOME}/lg3.conf' bash script (which should not be edited by the user). If a file 'lg3.conf' exists in the current working directory ("the project folder"), then that file is sourced after '${LG3_HOME}/lg3.conf', which makes it possible to override some or all of the predefined global settings on a project to project basis. The latter file can also be used to configure variables such as PATIENT etc.
-
Now 'lg3 status' defaults to using '--all'.
-
ROBUSTNESS: Now
bin/lg3-test
explicitly asserts that Rscript exists before attemption to use it.
-
Added option to use new Pindel 0.2.5b8 instead of Pindel 0.2.4t.
-
Added '_run_Align_mem' for an alternative data pre-processing compliant with Best Practices 2019.
-
Added '_run_Mutect2' for somatic mutation calling by Mutect2 and GATK 4.1.
-
Added '_run_Align_gz_no_trim' to align without trimming FASTQ data.
-
Added '_run_PSCN' for running the PSCN pipeline directly within the LG3 pipeline.
- Paths to resources and some other parameters are now in config file 'lg3.conf' (work in progress).
- More informative and consistent error messages are provided in more place by making more use of internal utility functions such as 'error', 'warn', 'assert_file_exists', 'assert_directory_exists', 'make_dir' and 'change_dir'.
- Recall_pass2.pbs failed to create symbolic link(s).
-
exomeQualityPlots pipeline is now integrated with the main pipeline.
-
Added stand-alone Germline analysis, exactly the same as in the Recal step, which can be used in case the Recal-Germline step fails.
- Fixed a wrong file extension in Recal_pass2.sh.
- Errors produced by the pipeline itself do now also output traceback information showing the function, line number, and script pathname call stack.
-
Using more informative names on variables used for script filenames.
-
Earlier detection of errors by asserting that expected output files are produced after each internal call of the pipeline finishes.
- All scripts are now cleaning scratch space in the end of the run.
- Now
./_run_MutDet
reports on theNORMAL
,TUMOR
, andTYPE
inferred from theCONV
file and thePATIENT
name, and asserts that such entries actually exist in theCONV
file.
- It appears not to be possible to quote
INPUT
filenames for Picard, i.e. we cannot useINPUT="<file>"
but have to stick withINPUT=<file>
. This means that those input file names must not have spaces. GATK has the same limitation on its-I <file>
option.
-
The (optional)
_run_Merge
step would produce error: "scripts/Merge.sh: line 6: PROJECT: parameter null or not set". -
Run scripts
_run_MutDet
,_run_Merge
, and_run_Merge_QC
would fail if previous step used aPROJECT
other than the default 'LG3'. -
Pipeline would not support tab-delimited patient files with Microsoft Windows-style line endings, i.e. CRLF (
\r\n
) line endings. -
scripts/chk_mutdet.sh
did not acknowledge environment variable 'CONV'.
- Environment variable
LG3_HOME
must now be set. If not set, an error is produced. It used to default to a Costello Lab specific location on the TIPCC cluster.
-
Alignment jobs now require less memory by default (64 GiB RAM instead of 100 GiB), which should decreased the average default queuing time.
-
Added
lg3 envir
for displaying current environment variables related to the LG3 Pipeline. -
Added
lg3 --news
for displaying the NEWS.md file in the terminal.
- Patient IDs must not contain underscores (
_
) because the Pindel step of the pipeline does not support that. All steps of the pipeline now assert that patient IDs do no contain underscores.
- PBS scripts would only run on TIPCC compute nodes that support the legacy
PBS
bigmem
flag. By removing this unnecessarybigmem
requirement from all PBS scripts, jobs can now run on all compute nodes that meet the core and memory requirement specified by each PBS script (or is overridden in the LG3 call).
- Run scripts now infer
SAMPLES
andNORMAL
from the patient file (CONV
) given thePATIENT
name. It is no longer necessary to set environment variablesSAMPLES
andNORMAL
when running the pipeline. These variables will become deprecated soon and later produce an error if specified.
- Add section on 'Contributors' to the README.
-
HARMONIZATION: Standardizing variable names throughout all scripts.
-
TESTS: Tests now defaults to using Patient157t10.
-
TESTS: Added test set 'Patient157_t10_underscore' (sic!) containing FASTQ files with additional underscores and
_R1
/_R2` suffixes in their names. This test set is just a renamed copy of the existing 'Patient157t10' set.
-
The pipeline did not support FASTQ file names with underscores (
_
) other than the once indicating paired end reads_R1
and_R2
. File names with a suffix between_R1
/_R2
and.fastq.gz
were also not supported. Note that trimming drops any_R1/_R2
suffixes, e.g. trimming a FASTQ fileZ00600_t10_AATCCGTC_L007_R1_001_HQ_paired.fastq.gz
produces a trimmed FASTQ fileZ00601_t10_AATCCGTC_L007-trim_R1.fastq.gz
. -
Some run scripts (
_run_MutDet
), job scripts (Recal_bigmem.pbs
,MutDet_TvsN.pbs
, andUG.pbs
), and scripts (scripts/chk_mutdet.sh
andscripts/chk_pindel.sh
) did not catch errors and quit with exit code 1. -
lg3 test setup
incorrectly reported that the CONV file does not exist.
-
The recalibration steps (Recal and Recall_pass2) now specifies the region list when calling GATK's RealignerTargetCreator for creating the intervals used for indel detection. This significantly speeds up these recalibration steps, e.g. recalibration of Patient157t (two chromosomes) went down from ~14-15 hours to ~6 hours. In addition, the power to detect mutations should improve by specifying regions because we will not waste power in testing for mutations outside these regions. Note that the set of mutations identified will change slightly because of this, i.e. although the difference should be few, ideally already processed samples should be reprocessed.
-
The default output folder is now 'output/' in the current working directory. It used to be a folder specific to the Costello Lab, which could be overridden by setting
LG3_OUTPUT_ROOT
. There is no longer a need to set this environment variable, which soon will be deprecated together with the LG3_INPUT_ROOT environment variable.
-
Now 'lg3 test setup' also installs required R packages, if missing.
-
Now
lg3 test validate
supports also the new Patient157t10 data set. -
ROBUSTNESS: Now
_run_Recal
and_run_Recal_pass2
, asserts thatNORMAL
is part of the specifiedSAMPLES
set. -
Harmonized the names of the *.out and *.err log files produced by the run scripts.
-
lg3 status
now uses boolean flags instead of options with boolean values. -
Scripts now report on the hostname to help any troubleshooting.
-
The last run script,
_run_PostMut
, did not acknowledge the environment variablePROJECT
in one of its parts, where it instead used a hardcodedLG3
value. -
lg3 test validate
failed ifPROJECT
was not the default value ('LG3'). -
FilterMutations/filter.profile.sh
added/home/jocostello/shared/LG3_Pipeline
to thePYTHONPATH
instead of${LG3_HOME}
.
-
Added
_run_Recal_pass2
for recalibrating merged BAM files. -
Now the default input folder for trimming and alignment is rawdata/. It used to be an absolute path specific to the Costello lab storage.
-
Chastity filtering (prior to alignment of FASTQ files) is now disabled by default. To enable, set environment variable
LG3_CHASTITY_FILTERING=true
. -
The default output folder to which trimmed FASTQ files are written is now defined by the
LG3_OUTPUT_ROOT
environment variable (default is output/) rather than the folder of the raw FASTQ files.
-
Add bin/ folder with
lg3
command. Currently, it implementslg3 status
andlg3 test
.lg3 status
is used for checking of the output on the different stages in the pipeline.lg3 test setup
is used to set up the test example.lg3 test validate
is used to validate the results of the test example toward a reference currently stored on the TIPCC file system. -
More scripts now takes environment variable
PROJECT
(defaults toLG3
) as an optional input to control the subfolder of the output data. -
If the optional _run_Recal_pass2 step is run, which occurs after recalibration and merging, it will rename the existing exome_recal/$PATIENT/ subfolder to exome_recal/$PATIENT.before.merge/ such that the final output is always in exome_recal/$PATIENT/ regardless of merging or not.
-
Now using extension *.tsv for patient_ID_conversions.tsv (was *.txt) to clarify that it is a tab-delimited file.
- README now include instructions on how to check the progress (
lg3 status
) and the reproducibility of the test example (lg3 test setup
andlg3 test validate
).
- HARMONIZATION: Using
PROJECT
everywhere; previouslyPROJ
was also used.
-
_run_Align_gz failed to detect already processed samples (due to a typo).
-
_run_Align_gz used the wrong default input folder - it looked for the trimmed FASTQ files in rawdata/ rather than output/.
-
Scratch folders were not job specific for most TIPCC users.
-
Environment variable
EMAIL
must now be set in order to run any of the steps in the pipeline; if not set, an informative error is produced. Set it to the email address where you wish the scheduler to send job reports, e.g.export EMAIL=alice@example.org
. -
Renamed the optional environment variable
CHASTITY_FILTERING
toLG3_CHASTITY_FILTERING
.
-
Added run scripts
_run_Merge
and_run_Merge_QC
for merging recalibrated, replicated BAM files that are for the same sample. -
Environment variable
LG3_INPUT_ROOT
is now optional. If not specified, it will be set to a sensible default depending onLG3_OUTPUT_ROOT
. -
In order to minimize the risk for clashes, now using user and job specific scratch folders - used to only be only user specific.
-
Giving more informative error message in case files are missing.
-
Updated README with details on how to run the pipeline on the example test data and from any location.
-
Mention
module load CBC lg3
for TIPCC users.
-
Run script
_run_Pindel
assumed that resources/ folder was in the working directory rather than in theLG3_HOME
directory. -
A jobs that was allocated 12 cores by the scheduler would only run 2 cores, because the first digits was dropped due to a Bash typo. This bug was introduced in the previous version.
- The pipeline can now be run by any user on the TIPCC compute cluster by
setting environment variables
LG3_HOME
,LG3_OUTPUT_ROOT
, andLG3_INPUT_ROOT
. If not set, the default is to use the hardcoded folders used in previous versions of the pipeline.
-
The location of the LG3 Pipeline folder can now set via environment variable
LG3_HOME
, e.g.export LG3_HOME=/path/to/LG3_Pipeline
. -
The location of where result files are written can be set via environment variable
LG3_OUTPUT_PATH
. For example, to output to the folderoutput/
in the current directory useexport LG3_OUTPUT_PATH=output
. The folder will be created, if missing. -
The location of where output files from previous steps in the pipeline is located can be set via environment variable
LG3_INPUT_PATH
, which should typically be set to the same folder asLG3_OUTPUT_PATH
, i.e.export LG3_INPUT_PATH=${LG3_OUTPUT_PATH}
. The folder will be created, if missing. -
Environment variable
EMAIL
can be used to set the email address to which the Torque/PBS scheduler will send email reports when the jobs finishes. -
Chastity filtering (prior to alignment of FASTQ files) is now optional by setting environment variable
CHASTITY_FILTERING
(default is true). -
Generalized the
run_demo/_run_*
scripts to make it easier to reuse them for other samples. -
Most scripts do now respect the number of cores assigned to it (
PBS_NUM_PPN
) by the Torque/PBS scheduler. This makes it easier to increase the amount of parallelization used. It also lowers the risk of using more cores by mistake than assigned. -
That all required input files exist is now asserted as soon as possible and in all steps in order to detect (user or coding) mistakes as early as possible, which helps troubleshooting.
-
TESTS: Add tests/Makefile to simplify testing of all the steps.
-
TESTS: Two in-house tumor-normal data sets are now available for testing the pipeline; one complete whole-genome sample ('Patient157') and one two-chromosome subset ('Patient157t') of the same sample. Testing of full sample takes ~120+ hours (walltime) and the smaller sample ~20 hours to complete. Note, it is necessary to disable chastity filtering when testing with the smaller set.
- If
PYTHONPATH
was set to include a non Python 2.6.6 version, then various Python errors were produced. NowPYTHONPATH
is unset everywhere before calling Python.
- Add "how-to-run" instructions to README.
-
All PBS (*.pbs) and Bash scripts (scripts/*.sh) now pass ShellCheck tests. Significant changes involve quoting command-line options, adding assertions that
cd
andmkdir
actual works. -
Add
make check
to check scripts with ShellCheck. -
The code is now continuously validated using the Travis CI service.
- Several Bash scripts/*.sh had
#!/bin/csh
shebangs.