After successfully installing all the dependencies, the following command can be directly run to test PPIprophet with the example dataset (i.e. test/test_fract.txt):
python3 main.py -sid test/test_ids.txt
The default input and output folders are '/test/' and '/Output/' respectively, under the PPIprophet working folder. It will generally take ~1 hr per file to finish but the computation time increases exponentially depending on the nr of protein ids in the file. We suggest to employ an high performance computing environment if submitting a whole proteome search and not an affinity purified sample.
In the PPIprophet package, all parameters can be configured either via the ‘ProphetConfig.conf’ file or via by running PPIprophet using the command. When running the PPIprophet, the parameters indicated in the command will be written into the ‘ProphetConfig.conf’ file. Generally, four types of features are needed:
Input file need to be a wide format matrix with two essential columns:
GN : Gene name or protein id, needs to be the first column. This is a unique identifier and having duplicate rows will trigger a DuplicateError from PPIprophet.
Remaining columns needs to be ordered according to the fractionation scheme used. There is no strict requirement for column names apart from GN and ID, but they need to be ordered. All quantitation schemes commonly used in proteomics such as MS1 or MS2 ion-extracted chromatogram (XIC), spectral counts (SPCs) and TMT or SILAC ratios are supported.
Examples of correct formatting are provided under test/test_fract.txt data.
-output The output folder
-sid Sample identifier file
-all The number of fractions to use [1, X].
-is_ppi Is the provided database a PPI network or a complex database
-ma Choose ‘all’ for using data-driven+database based hypothesis generation and ‘reference’ use only database derived complexes
-fdr False discovery rate for hypothesis 0 > FDR > 1
all parameters can be inspected using
python3 main.py --help
The file ‘sample_ids.txt’ stores the experimental information and needs to contain the following headers:
Sample | cond | group | short_id | repl | fr |
---|
- Sample full path of the file intended to be processed
- cond condition name
- group group number (integer, needs to be 1 for control)
- short_id alternative id
- repl replicate number within the contiions
- fr number of fractions per file
Note: In the ‘Sample’ column, please make sure that the content is identical with the testing file name (with the file extension). In the ‘cond’ column, if you have multiple conditions, please label them exactly as ‘Ctrl’, ‘Treat1’, and ‘Treat2’ etc. Failure to do so will cause problems when running PPIprophet.
Here is an example of a complete table with two conditions and three replicates:
Sample | cond | group | short_id | repl | fr |
---|---|---|---|---|---|
./Input/c1r1.txt | Ctrl | 1 | ipsc_2i_1 | 1 | 65 |
./Input/c1r2.txt | Ctrl | 1 | ipsc_2i_2 | 2 | 64 |
./Input/c1r3.txt | Ctrl | 1 | ipsc_2i_3 | 3 | 65 |
./Input/c2r1.txt | Treat1 | 2 | ipsc_ra_1 | 1 | 65 |
./Input/c2r2.txt | Treat1 | 2 | ipsc_ra_2 | 2 | 65 |
./Input/c2r3.txt | Treat1 | 2 | ipsc_ra_3 | 3 | 65 |
PPIprophet can be using all default settings with
python3 main.py
There will be two folders generated by the PPIprophet, including the ‘tmp’ folder and the user designated ‘Output’ folder. The ‘tmp’ folder stores all the intermediate files for PPIprophet to process and therefore can be used for debugging and validation. The ‘tmp’ folder can be safely deleted after PPIprophet finishes all the prediction and analysis.
The ‘Output’ folder, on the other hand, harboured all the output files generated by PPIprophet.
In the output folder the following text files are present:
- adj_list.txt: PPI list in the form proteinA/proteinB/Probability (from DNN prediction) and Crapome frequencies for both proteins.
- communities.txt: Modules detected after MCL clustering of the modified WD scores.
- d_scores.txt: Interaction probabilities converted to modified WD scores (see paper for mathematical description and derivation)
- probtot.txt: adjacency list where proteins having interaction below the scored thresholds (at the user specified FDR) are zeroed. Users can find the full scored matrix (prior to FDR filtering) in tmp/dnn.txt
- prot_centr.txt: Protein-centric output where for every protein entry all the identified interactors are concatenated.
Both probtot.txt, d_scores.txt are fully compatible with Cytoscape import without the need for any additional formatting. For additional mapping adj_list.txt can be used to visualize frequency of interactions in crapome.
Depending on the error raised different fixes are needed.
- NaRowError / NaInMatrixError: There are 'NA' values in the input matrix, substitute them with 0
- MissingColumnError: Identifier columns (GN or ID) are missing
- DuplicateRowError / DuplicateIdentifierError: There are duplicates in the GN column. A common cause of this is mapping of isoform to the same gene name. Just add _1 to one of the duplicate gene names
- EmptyColumnError: A column is only NA
Note for imputing column values in case of different number of fractions add a full 0 column
Please refer to README.md for how to contact us.