Skip to content

Running SymPortal

didillysquat edited this page Jul 30, 2019 · 48 revisions

This section will guide you through the 2 main steps of running a SymPortal analysis:

It will also cover some lower level functionality including:

This guide assumes you already have SymPortal setup and uses the same example dataset that is used in the SymPortal manuscript. This dataset can be downloaded from here.

If you have set up SymPortal correctly, you should be able to run the test.py module successfully. Ensure that you are within the SymPortal_framework root directory and you have your virtual environment activated (if you are using one).

$ python3 -m tests.tests

This will perform two loadings and one data analysis.

The end of the output should look similar to:

Plotting between its2 type profile distances
Distance plots output to:
/Users/humebc/Documents/testing_2/SymPortal_framework/outputs/analyses/1/between_its2_type_profile_distances/C/2019-01-29_06-38-26.198583_between_its2_type_prof_dist_clade_C.svg
/Users/humebc/Documents/testing_2/SymPortal_framework/outputs/analyses/1/between_its2_type_profile_distances/C/2019-01-29_06-38-26.198583_between_its2_type_prof_dist_clade_C.png
data_analysis ID is: 1
Analysis testing SUCCESSFUL
Deleting /Users/humebc/Documents/testing_2/SymPortal_framework/outputs/analyses/1
Output directory deleted
test data_analysis 1 deleted
testing complete

i. Download data

Download the dataset from here. Note the directory in which the dataset is saved. This will be used as an input in the following steps.


Loading data

The first step of analysing any dataset is to load it into the SymPortal framework's database. In this step, SymPortal will perform all quality control filtering of the sequence data and convert the raw sequence data into database objects such as DataSet, DataSetSample, CladeCollection, DataSetSampleSequence and ReferenceSequence. For more information on the database objects and how to query the database, please see the Querying the SymPortal database page. As part of a DataSet loading, SymPortal will output a count table of the ITS2 sequences returned from each of the DataSetSamples in the DataSet. A plot visualizing this count table will also be produced. In addition, SymPortal will generate clade separated, between sample pairwise distance matrices (BrayCurtis-based by default). By default SymPortal will also run a principal coordinate analysis (PCoA) on these distance matrices and return the coordinates of the principal components for each DataSetSample.

To load a dataset to the database:

$ ./main.py --load /path/to/example_data_location --name first_loading

By default the loading will be completed using one process however multiple processors may be utilised using the --num_proc argument.

$ ./main.py --load /path/to/example_data_location --name first_loading --num_proc 3

The --data_sheet flag may also be applied for loadings (recommended). By applying this flag and passing the full path to a data sheet meta information may be associated to the DataSetSample objects held in the SymPortal database. Currently the meta information that is associated to a DataSetSample is limited to the pre-labelled column headers provided in the blank data sheet. However, users may populate the data sheet with additional information in the columns to the right of the pre-labelled headers with no detriment to the loading. For the purposes of the Smith et al example submission here a pre-populated data sheet has been provided in the google drive folder containing the data. When running your own submission use the blank data sheet.

To run the data loading with the provided data sheet in this example:

$ ./main.py --load /path/to/example_data_location --name first_loading --num_proc 3 --data_sheet /path/to/example_data_location/smith_et_al_meta_input.xlxs

Passing a data sheet at loading also allows for custom sample names to be associated to each of the fastq.gz pairs. If no data sheet is provided, sample names will be generated automatically from the name of the respective fastq.gz pairs.

To switch off the automatic generation of plots (stacked bar charts for the count tables and scatter plots of the clade separated PCoA coordinates), the --no_figures flag can be passed as an argument.

$ ./main.py --load /path/to/example_data_location --name first_loading --num_proc 3 --data_sheet /path/to/example_data_location/smith_et_al_meta_input.xlxs --no_figures

To switch off the ordination component of the data_loading the --no_ordinations flag may be passed.

$ ./main.py --load /path/to/example_data_location --name first_loading --num_proc 3 --data_sheet /path/to/example_data_location/smith_et_al_meta_input.xlxs --no_ordinations

Both the --no_figures and --no_ordinations flags may be passed simultaneously.

A note on running SymPortal with multiple processors

SymPortal has been designed to take advantage of multi-processor environments and is parallelised wherever possible. As of version 0.3.0 users should have no problems with instability when running across many cores whilst using the SQLite database.

However, in versions previous to 0.3.0 users should have caution when loading and running analyses across multiple processors: the somewhat simple SQLite database that the local framework comes setup with by default (for sake of simplicity) has limited support for handling simultaneous requests to write. As such, whilst all of the SymPortal functions may be run with multiple processors, when using the default SQLite database, the chances of a 'timeout' failure or conflict will increase with the number of processors used. In general a small degree of parallelisation, e.g. --num_proc 3, should be very unlikely to cause any issue. To robustly run SymPortal in a highly parallelised manner, the SQLite database should be upgraded to a server based PostgreSQL database.

There are multiple online resources that can be used to aid in setting up the Django framework (the framework underlying SymPortal's interaction with the database) to run with a PostgreSQL, rather than a SQLite, database: here, here and here for example.

Checking DataSet loadings

The ID, name and time stamp of the loaded DataSet instances can be output by running the following command:

$ python3.6 main.py --display_data_sets
1: first_loading	2018-07-04 05:07:59.418975

Debug mode

The --debug flag can be passed when loading data. When this flag is passed the output to stdout will be significantly more verbose. This is useful in identifying where errors may be occurring.


Running an analysis

Running a SymPortal analysis will programmatically search for recurring sets of ITS2 sequences in the DataSetSamples of the DataSet objects submitted to the analysis. The output of an analysis is a count table of predicted ITS2 type profiles (representative of putative taxa). By default a graphical representation of this count table, as well as the sequences found in the samples, is output. Clade separated, between ITS2 type profile pairwise similarity distance matrices are also output (BrayCurtis-based by default). As are the coordinates of the PCoA run on each of these matrices.

To run an analysis on one of the DataSet instances that have been loaded into the database:

$ ./main.py --analyse 1 --name first_analysis --num_proc 3

Where 1 is the ID of the DataSet to analyse.

Note that when running an analysis containing only one DataSet instance, an int can be passed to the --analyse flag. However, if you wish to run an analysis containing more than one DataSet instance, a comma separated string may be passed.

e.g.

$ ./main.py --analyse 1,2,5 --name second_analysis --num_proc 3

Similar to running a DataSet loading, the --no_figures and --no_ordinations flags may be passed (no figure output; no ordination or similarity calculations).

N.B. only the datasets that have been listed as part of the --analyse command (e.g. datasets with UIDs 1, 2, and 5 above) will be used to generate ITS2 type profiles. E.g. if there was another dataset loaded into the database (e.g. UID 6) this dataset would have absolutely no influence of the analysis results for the above command.

Checking DataAnalysis instances

The ID, name and time stamp of completed DataAnalysis instances can be output by running the following command:

$ ./main.py --display_analyses
1: first_analysis	2018-07-04 05:57:23.933207

Data output - independent of loading or analysis

Count tables for ITS2 sequence and ITS2 type profile abundances are output automatically during data loading and analyses, respectively. However, these count tables may also be output independent of a loading or analysis. Corresponding plots will also be output.

To output only the ITS2 sequence count table (i.e. equivalent to the count table output after a DataSet loading):

The output files should be displayed on completion of table generation and output

$ ./main.py --print_output_seqs 1
DIV table output files:
./outputs/non_analysis/1.DIVs.absolute.txt
./outputs/non_analysis/1.DIVs.relative.txt
./outputs/non_analysis/1.DIVs.fasta

To output both ITS2 sequence and ITS2 type profile count tables (i.e. equivalent to the count tables output during an analysis):

$ ./main.py --print_output_types 1 --data_analysis_id 1 --num_proc 3
DIV table output files:
/SymPortal_framework/outputs/1/1_1.DIVs.absolute.txt
/SymPortal_framework/outputs/1/1_1.DIVs.relative.txt
/SymPortal_framework/outputs/1/1_1.DIVs.fasta
ITS2 type profile output files:
/SymPortal_framework/outputs/1/1_1.profiles.absolute.txt
/SymPortal_framework/outputs/1/1_1.profiles.relative.txt

For either output type, multiple DataSet IDs may be passed:

$ ./main.py --print_output_types 1,3,5 --data_analysis_id 1 --num_proc 3

This may be useful when only wanting to output a subset of the DataSet objects that were input to an analysis.

In addition to outputting count tables for given DataSet and DataAnalysis objects it is also possible to produce the above outputs for a given set of DataSetSample objects (samples). These samples can be from multiple DataSet objects.

To output the ITS2 sequence count tables for a collection of samples, irrespective of a fixed collection of DataSet objects, a commar separated value you can be passed to the --print_output_seqs_sample_set flag.

$ ./main.py --print_output_seqs_sample_set 1221,1222,1223,1224,6748,4789 --num_proc 3

The equivalent command to output the ITS2 type profile count tables, as well as the ITS2 sequence count tables is:

$ ./main.py --print_output_types_sample_set 1221,1222,1223,1224,6748,4789, --data_analysis_id 8 --num_proc 3

Generating within clade, pairwise distances and PCoA

Pairwise distances and Principal Coordinate Analyses may be generated as either between sample or between ITS2 type profile. Distance matrices may be generated either by a BrayCurtis- or UniFrac-based methodology. The BrayCurtis method is faster and requires no additional dependencies. BrayCurtis, due to its lack of thrid party program requirements, is run as default. However, BrayCurtis can perform less optimally than the UniFrac method when comparing very closely related ITS2 sequence profiles. The UniFrac methodology requires more time to compute and additional dependencies. Please see this section in the wiki regarding additional dependencies. To use a UniFrac based methodology rather than the default BrayCurtis methodology pass --distance_method unifrac.

N.B. when calculating between ITS2 type profile distances, the --data_analysis_id must always be provided with the UID of the data analysis the ITS2 type profiles were predicted from being supplied (see examples below).

Distances may be generated using three different input methods:

1. Single or multiple DataSet UIDs may be provided as the argument to either --between_sample_distances or --between_type_distances.

e.g.:

To calculate between sample distances from all sample in the DataSet with UID 5: $ ./main.py --between_sample_distances 5 --num_proc 3

or

To calculate between ITS2 type profile distances for all ITS2 type profiles predicted from the DataAnalysis with UID 2, only outputting for samples that were in the DataSet with UID 5: $ ./main.py --between_type_distances 5 --data_analysis_id 2 (note that the DataSets to be output must have been run in the DataAnalysis).

2. Multiple DataSetSample UIDs may be provided as the argument to either --between_sample_distances_sample_set or --between_type_distances_sample_set.

e.g.:

To calculate between sample distances from DataSetSample objects with the UIDs 6, 7, 8 and 23: $ ./main.py --between_sample_distances_sample_set 6,7,8,23

This option allows for the calculation of distances between samples that may be a subset of one or multile DataSet(s).

or

To caclulate between ITS2 type profile distances for all ITS2 type profiles predicted for the DataSetSample objects with UIDs 6, 7, 8 and 23, as predicted from the DataAnalysis with the UID 2: $ ./main.py --between_type_distances_sample_set 6,7,8,23 --data_analysis_id 2 (note that the DataSetSamples to be output must have been run in the DataAnalysis).

3. For between ITS2 type profile distances only, UIDs of CladeCollectionType objects may be provided to the --between_type_distances_cct_set argument

A CladeCollectionType object represents the junction between a CladeCollection object (a grouping of sequences from the same clade all from a single sample) and an AnalysisType object (proxy for an ITS2 type profile) that has been predicted from the sequences in question. Calculating distances between specific CladeCollectionType objects therefore represents calculating distances between specific instances of ITS2 type profiles that were predicted. E.g. $ ./main.py --between_type_distances_cct_set 34,35,36,37 --data_analysis_id 2.

You can’t perform that action at this time.