An automated, fast and accurate tool that utilises cloud computing and machine learning to perform data mining in the GEO database
Python3 is required, install the required Python packages.
pip install -r requirements.txt
PySpark is also required, follow the instructions here to install.
Install the required R packages.
Rscript install_packages.R
Download the required databases and tools.
bash dl_db_tools.sh <output_dir>
The pipeline contains two steps - classification and analysis. The classification step identifies perturbation experiments, groups replicate samples and matches control and perturbation samples. The analysis step calculates differential expression using the classification results from the previous step.
The classification step requires either a file containing a list of GSE IDs by using the -i
option, or some keywords to search through GEO using the -ak
, -ok
and/or -o
options.
Below is an example of the format of the file containing a list of file.
GSE14491
GSE16416
GSE17708
GSE23952
GSE28448
GSE42373
Example command for running the classification step.
spark-submit source/classify_gse/classify_gse.py \
-i gse_ids.txt \
-d GEOmetadb.sqlite \
-t classifiers/gse_clf_title.pickle \
-s classifiers/gse_clf_summary.pickle \
-m classifiers/gsm_clf.pickle \
-n NobleCoder-1.0.jar
Detailed script usage can be access using the -h
option.
The analysis step requires the output file from the previous step.
Example command for running the analysis step.
spark-submit source/analyse_gse/analyse_gse.py \
-i microarray_classified_results.csv \
-d GEOmetadb.sqlite \
-ms source/analyse_gse/analyse_microarray.R \
-rs source/analyse_gse/analyse_rna_seq.R \
-mm mouse_matrix.h5 \
-hm human_matrix.h5
Detailed script usage can be access using the -h
option.
The AWS Command Line Interface (AWS CLI) client is a utility from which the user can interface with the AWS "universe": list objects in an S3 database, check on a running EMR cluster, and many other functions.
Instructions on how to install the AWS CLI can be found at: http://docs.aws.amazon.com/cli/latest/userguide/installing.html
In a Linux environment, the command to install the AWS CLI is:
pip install awscli
Once installed, the AWS CLI needs to be configured with your AWS credentials.
Running GEOracle+ on AWS requires the user to have an AWS account, and to additionally have created an AWS Access Key. The AWS Access Key enables programmatic access to AWS resources. For information on how to create an AWS Access key, refer to:
Managing Access Keys for IAM Users
When you have your AWS Access Key ID and Secret Access Key, you should configure your AWS CLI, by typing the following command:
aws configure
Enter the information as prompted, including the Default region name. You may leave the Default output format by
pressing Enter
if you are unsure about this field.
Once this configuration is complete, you can use the AWS CLI by typing something like:
aws help
Configure config/emr_cluster.config
file for launching cluster. And also configure config/classify_job.config
and config/de_analysis_job.config
files for submitting classification and analysis jobs respectively.
After that, run python launch_cluster.py --config config/emr_cluster.config
to launch a cluster. Wait until the cluster has been created successfully before running python submit_classify_job.py --config config/classify_job.config
and python submit_de_analysis_job.py --config config/de_analysis_job.config
to submit classification and analysis jobs respectively.
The output files will be stored on the S3 bucket specified in the configuration files.