# GenePattern Python Library Tutorial

This is a short tutorial on how to use the [GenePattern Python library](http://www.broadinstitute.org/cancer/software/genepattern/programmers-guide#_Using_GenePattern_from_Python) in your Python scripts or in the [GenePattern Notebook](http://www.broadinstitute.org/cancer/software/genepattern/genepattern-notebooks) environment. In it we will develop a simple Python program that connects to a GenePattern Server, runs a module and loads the resulting files for further analysis. The included code can be executed directly from the notebook, so that you can try it out, modify it or create your own solutions. 

This tutorial assumes that the reader is familiar with [GenePattern](http://genepattern.org) and its basic associated concepts, such as modules, jobs and result files. If you wish to learn more about GenePattern, [click here](http://www.broadinstitute.org/cancer/software/genepattern/).

### Installation

If you are using this tutorial in the GenePattern Notebook environment, the GenePattern Python library will already be installed. If you're using this notebook outside of the GenePattern Notebook environment, you may need to install the library. [Installation instructions are available here.](http://www.broadinstitute.org/cancer/software/genepattern/programmers-guide#_Using_GenePattern_from_Python)


### Compatibility

The GenePattern Python library supports both Python 2.7 and Python 3.3+.

## Import the Library

The very first thing you will need to do is to import the GenePattern library into your script. This can be achieved by entering the following code shown below. All methods provided by the GenePattern library can then be accessed from the *gp* namespace.

In [10]:
import gp

## Connect to a GenePattern Server

The next step in using the GenePattern Python library is to connect to an existing GenePattern server. This will require entering the URL of the server, as well as your username and password credentials. 

The code below connects to the public GenePattern server running at the Broad Institute. If you have not used this server before, you will first need to create an account. [Click here to register an account.](http://genepattern.broadinstitute.org/gp/pages/registerUser.jsf)

Needless to say, you will need to change *myusername* and *mypassword* in the code below to your actual username and password.

In [18]:
# Create a GenePattern server proxy instance
gpserver = gp.GPServer('https://cloud.genepattern.org/gp','myusername', 'mypassword')

### Query for Available Modules 

If you are not aware of which modules are available on the GenePattern server, this can be programmatically explored by running the code shown below.

In [38]:
# Get the list of tasks
task_list = gpserver.get_task_list()

This will return a list of GPTask objects. A GPTask object is a Python object which represents a specific module. In the case of the command above, the returned list will contain GPTask objects for every module the server makes available. These GPTask objects will provide the module name, LSID, a description and the version number of the module. More on GPTask objects is described in the next section.

This list can be iterated over like any Python list to obtain information from each GPTask object. Below is a code example that will print the name of each module.

In [None]:
for task in task_list:
    print(task.get_name())

## Referencing a Module

If you already know the name or LSID of the module you want, you can also obtain a GPTask object for it directly. 

Throughout this tutorial we will use the PreprocessDataset module. This module is designed to perform some basic preprocessing steps on gene expression data. For example, it can apply a ceiling, floor and fold change filter. More on this module is [available here](http://www.broadinstitute.org/cancer/software/genepattern/modules/docs/PreprocessDataset/5).

The example code below obtains a GPTask object for the PreprocessDataset module in two different ways: by name and by LSID. Either one of these methods works to obtain the reference, although specifying the LSID is more specific as it contains a specific version number for the module, whereas specifying by name always obtains the latest version of the module. See how GenePattern implements  [version numbers](https://www.genepattern.org/concepts/#_Version_Numbers) using Life Science Identifiers (LSIDs).

In [None]:
# Obtaining GPTask by module name
module = gp.GPTask(gpserver, "PreprocessDataset")

# Obtaining GPTask by LSID
module = gp.GPTask(gpserver, "urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00020")

### Loading Parameter Information

At this point the GPTask object only contains some very basic data used to identify the module. Before you can use this module to run jobs, however, you will first need to load the full parameter data from the server. This can be accomplished by running the code shown below.

In [None]:
# Load the full parameter data
module.param_load()

Below is code example showing how to reference the information that was loaded for the module.

In [None]:
# Print the module name
print( module.get_name() )

# Print the module LSID
print( module.get_lsid() )

# Print the module version
print( module.get_version() )

# Print the description
print( module.get_description() )

After being loaded, each of the module's parameters will be represented by a GPTaskParam object. A list of each of the parameters can be obtained and explored using the example code below. Each GPTaskParam object contains a parameter name, type, description, whether it's optional and other metadata, as shown.

In [None]:
# Get the list of GPTaskParam objects
params_list = module.get_parameters()

for param in params_list:              # Loop through each parameter
    print( param.get_name() )          # Print the parameter's name
    print( param.get_type() )          # Print the parameter's type (text, number, file, etc.)
    print( param.get_description() )   # Print the parameter's description
    print( param.get_default_value() ) # Print the parameter's default value
    print( param.is_optional() )       # Print whether the parameter is optional
    print( '' )                        # Leave a blank line between printed parameters

Some parameters come with a list of valid choices. These parameters can be identified by calling the is_choice_param() method. Additionally, choice parameters have a number of other methods available for working with the choice list. Calling these methods on a non-choice parameter will result in an error being thrown. An example is given below.

In [None]:
# Loop through each parameter
for param in params_list:
    if param.is_choice_param():        # If the parameter is a choice param
        print( param.get_name() )      # Print the parameter's name
        
        choices = param.get_choices()  # Get a list of valid choices 
        for choice in choices:         # Print the label and value for each choice
            print( choice['label'] + " = " + choice['value'] )
            
        # Print the default selected value for each choice
        print( param.get_choice_selected_value() )

## Creating a Job Specification

In order to run a GenePattern job from Python, you must first obtain a GPJobSpec object from the correct GPTask object and then set the appropriate parameters for the job. For many parameters their default values will suffice. For others, you will want to set a specific value.

Below is code showing how to obtain a GPJobSpec object and how iterate over the parameters, setting them to their default values.

In [None]:
# Create the GPJobSpec
job_spec = module.make_job_spec()

# Loop through all the parameters and set their default values
for param in module.get_parameters():  
    # If the parameter has a default value, set that value
    if param.get_default_value() != None: 
        # Set the default value
        job_spec.set_parameter( param.get_name(), param.get_default_value() )  

To set a specific value for a parameter, the set_parameter() method should be called. In the code below you will set the *input.filename* parameter to point to a publicly available dataset. This data should suffice for the purposes of this tutorial.

In [None]:
# Attach the input file to the correct parameter
job_spec.set_parameter("input.filename", "https://datasets.genepattern.org/data/all_aml/all_aml_test.gct") 

Data files can be uploaded by calling GPServer.upload_file(). This will return a GPFile object, and the parameter can be set to point to the URL of this object. An example for the PreprocessDataset module is shown below. The code has been commented out, however, as it will not be used in this tutorial.

In [None]:
# Upload the input file
# uploaded_file = gpserver.upload_file("file_name", "/path/to/the/file/on/the/file/system/file_name")  

# Attach the input file to the correct parameter
# job_spec.set_parameter("input.filename", 'uploaded_file.get_url()')  

## Submitting Your First Job

Once the GPJobSpec is ready, it can be used to launch a GenePattern job. This will return a GPJob object, representing the specific job that was just launched. A code example of how to do this is below.

In [None]:
# This will return the job object and continue execution even if the job isn't finished
job = gpserver.run_job(job_spec, False)

Why are we passing *False* in as a parameter, you ask? By default the run_job() method will halt code execution of your Python script until the job has finished running in GenePattern. For long running jobs, however, this may not be desirable. By optionally passing in *False* as a parameter, the method will return as soon as the job is submitted, allowing the Python program to continue.

For the purposes of this tutorial, it is better that you do not have to wait. If you did want to submit the job and wait for it to complete, however, the code is below (albeit commented out).

In [40]:
#  This will halt execution until the job is complete
# job = gpserver.run_job(job_spec)

### Querying for Job Status

When a GenePattern job is submitted, it passes through several states: pending, running and then either to complete or error. At any time after a job has been submitted, its status can be checked by calling *get_status_message()*. Similarly, its completion can be checked by calling *is_finished()*. Examples of both are shown below.

In [None]:
# Prints a brief description of the job's current state
print( job.get_status_message() )

# Quaries the server and returns True if the job is complete, False otherwise
print( job.is_finished() )

Finally, if at any point you decide that you just want to wait until the job is complete, you can always call *wait_until_done()*.

In [None]:
job.wait_until_done()

## Working with Output Files

Once the job is complete, and assuming there were no errors, a list of its output files may be obtained by making the *get_output_files()* call shown below. 

This will return a list of GPFile objects, each containing methods to download or read the contents of the file.

In [None]:
# Get a list of output files
output_list = job.get_output_files()  

for file in output_list:     # Loop through each output file
    print( file.get_url() )  # Print the URL to the file
    data = file.read()       # Read the data in the file 

Once the contents of a data file has been assigned to a variable, it may used in conjunction with other common Python libraries, such as matplotlib, pandas, numpy or scipy.

The code below will print out the contents of the last output file assigned to the *data* variable in the previous code block.

In [None]:
print ( data )

This concludes the tutorial on how to work with GenePattern using the GenePattern Python library. For more information, please see the [GenePattern Programmer's Guide](http://www.broadinstitute.org/cancer/software/genepattern/programmers-guide).