Diatools is a software package for analyzing mass-spectrometer data acquired using data independent acquisition (DIA) mode. Currently, diatools consists of a single workflow for building a spectral library from data dependent aquisition mode (DDA) data and for analyzing DIA samples with the spectral library. The spectral library is built according to the protocol described in Schubert et al. 2015. The DIA data analysis is done with OpenSWATH software and the end result is a peptide intensity matrix. Optionally, a differential expression analysis of sample groups is performed.
Diatools runs on GNU/Linux, Windows and macOS platforms. However, the conversion of the mass-spectrometer proprietary raw files to open formats has to be done on Windows operating system.
Having at least 128GB of RAM is recommended. However, the required amount of memory depends greatly on the size of protein sequence database.
Diatools is designed to run under a Docker container. After installing Docker software, download the diatools docker image from Docker Hub.
$ docker pull elolab/diatools:1.0
Create a folder dataset
on the machine where the data-analysis is done and create the following subfolders under it:
- config
- DDA
- DIA
- ref
- out
This step needs to be done on a Windows platform using ProteoWizard software.
Note. Make sure filenames do not contain spaces. Convert all spaces into underscore character.
Convert the DDA raw files to mzXML format and do peak picking:
FOR %i IN (*.raw) DO \
"\Program Files\ProteoWizard\ProteoWizard 3.0.11252\qtofpeakpicker.exe" \
--resolution=2000 \
--area=1 \
--threshold=1 \
--smoothwidth=1.1 \
--in %i \
--out %~ni.mzXML
Convert the DIA raw files to mzML format by using the MSConvert program from the ProteoWizard software. Use the following options:
- Output format: mzML
- Extension: empty
- Binary encoding precision: 64bit
- Write index: checked
- TPP compatibility: checked
- Use zlib compression: unchecked
- Package in gzip: unchecked
- Use numpress linear compression: unchecked
- Use numpress short logged float compression: unchecked
- Use numpress short positive integer compression: unchecked
- Only titleMaker filter
Create a sequence database FASTA file that contains following protein sequences:
- Proteins of interest (for example Swiss-Prot Human)
- IRT peptides (Biognosys|iRT-Kit_WR_fusion)
- Peptides related to lysis (Uniprot ID: Q7M135)
- Digestion enzyme (typically Trypsin (Uniprot ID: P00761))
- Possible contaminants
Copy the FASTA file to the dataset/ref
folder with the name "sequences.fasta".
The default parameters are for a nanoflow HPLC system (Easy-nLC1200, Thermo Fisher Scientific) coupled to a Q Exactive HF mass spectrometer (Thermo Fisher Scientific) equipped with a nano-electrospray ionization source.
Below is the summary of the settings:
- Precursor mass tolerance: 10 ppm
- Fragment ion tolerance: 0.02 Da
- Cleavage site: Trypsin_P
- Fixed modification: Carbamidomethyl (C)
- Variable modification: Oxidation (M)
The search parameters can be customized by modifying comet.params.template
and xtandem_settings.xml
and copying the files to dataset/config
. The modified files are given to the pipeline with the following extra parameters:
--comet-cfg-template config/comet.params.template
--xtandem-cfg-template config/xtandem_settings.xml
Create a spectral library from the DDA files and analyze the DIA files against the library:
$ docker run --rm \
-v /LOCALPATH/dataset/:/dataset \
--workdir /dataset/out \
-u $(id -u):$(id -g) \
compbiomed/diatools \
/opt/diatools/dia-pipeline.py \
--in-DDA-mzXML ../DDA/*.mzXML \
--in-DIA-mzML ../DIA/*.mzML \
--db ../ref/sequences.fasta \
--use-comet \
--use-xtandem
Note, there might be need to modify the command above slightly if operating system is not Linux. For example, dataset path in windows command prompt should be: -v //c/LOCALPATH/dataset:/dataset
, where c is the drive letter. Furthermore on windows command prompt -u $(id -u):$(id -g)
parameter must be omitted.
After the pipeline has completed, the "out" folder contains dia-peptide-matrix.tsv
and dia-protein-matrix.tsv
files. The files are TSV formatted and can be loaded to spreadsheet programs like MS Excel or to a statistical analysis programs like R. The dia-peptide-matrix.tsv
contains detected peptides and their intensity for each of the samples. The first column contains peptide sequence and a list of possible source proteins. Rest of the columns indicate samples and contain peptide intensity value in each of the samples. Respectively dia-protein-matrix.tsv
contains intensity values, but at protein level.
To perform the optional differential expression analysis between sample groups, the groups must be provided using an additional parameter in the command:
--design-file <designFilename>
The design file must be defined as a tab-separated file (see example-design-file.tsv
), where the column Filename refers to the SWATH-MS filename of a sample, the column Condition is the group to which the sample belongs, the column BioReplicate refers to the biological replicate, and the column Run to the MS run.
docker run --rm compbiomed/diatools /opt/diatools/dia-pipeline.py --help
False discovery rate (FDR) of the library building can be adjusted with --library-FDR
parameter. TRIC feature alignment FDR can be adjusted with --feature-alignment-FDR
parameter. A comprehensive list of pipeline parameters can be obtained with following command:
The build process might take a while. Once the image build is completed create a container using the image by running the command below:
$ docker run \
-d \
-p 2222:22 \
--cap-add SYS_ADMIN \
--device /dev/fuse \
--security-opt apparmor:unconfined \
openms \
/usr/sbin/sshd -D
(Optional) Copy ssh-key to the running container. If this step is done, password is not asked at login. The password being asked is: "Ymko7WFcLfe4U".
ssh-copy-id "-p 222 root@localhost"
The environment is now up and running. There is an SSH-server listening at port 2222. Log into the environment by using a persistent screen session:
ssh -t -p 2222 root@localhost screen -R -d
Login. If the ssh-key is not copied you need to give the root password: "Ymko7WFcLfe4U". Now you should be in the environment. Mount the path with your data from some machine over ssh. You can replace the ip address 172.17.0.1
with your machine address that contains the data.
# sshfs -o allow_other my_user_name@172.17.0.1:/my/path/to/datafiles /mnt
Now the environment should be ready and your data can be found from /mnt
.
$ docker build -t diatools . -f Dockerfile
Convert Docker image to Singularity image:
$ singularity build diatools-1.0.sif docker-archive://diatools-docker.img
Run diatools:
$ singularity exec \
--bind /LOCALPATH/data/:/metaproteomics diatools-1.0.sif \
/opt/diatools/dia-pipeline.py \
--in-DDA-mzXML /metaproteomics/lib1.mzXML /metaproteomics/lib2.mzXML \
--in-DIA-mzML /metaproteomics/sample1.mzML /metaproteomics/sample2.mzML \
--use-comet \
--use-xtandem \
--db sequences.fasta