In [12]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.display(genepattern.session.register("https://cloud.genepattern.org/gp", "", ""))

GPAuthWidget()

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    1. Click <b>Run</b> in the cell below to activate Demo mode.
</div>


In [7]:
import nbtools
import demo_mode
from IPython.display import display, HTML

@nbtools.build_ui(parameters={'output_var': {'hide': True}})
def activate_demo_mode():
    """Enable demo mode for this notebook"""
    
    # Any job you enable as a demo job will need to be listed here, 
    # along with the module name and any parameters you want matched.
    # Parameters that are not listed will not be matched. You can 
    # list the same module multiple times, assuming you use different 
    # parameter match sets.
    
    demo_mode.set_demo_jobs([
        
        {
            'name': 'CollapseDataset',
            'job': 398423,
            'params': {
                'dataset.file':'all_genes_unversioned.gct',
                'chip.platform': 'ENSEMBL_human_gene.chip', 
                'collapse.mode':'Sum_of_probes',
                'output.file.name':'all_genes_hugo.gct',
            }
        },
        {
            'name': 'ConvertLineEndings',
            'job': 398422,   # Make sure to set the permissions of any demo job to 'public'
            'params': {
                'input.filename': 'all_aml_train.gct',
            }
        }
    ])


    # To activate demo mode, just call activate(). This example wraps 
    # the activation call behind a UI Builder cell, but you could have 
    # it called in different ways.

    demo_mode.activate()
    display(HTML('<div class="alert alert-success">Demo mode activated</div>'))
    
    # The code in this call has been left expanded for tutorial purposes

UIBuilder(description='Enable demo mode for this notebook', function_import='nbtools.tool(id="activate_demo_mo…

#  Part 1: Getting Data into GenePattern and GenePattern Notebook

In this notebook we are going to cover the different ways you can get your data into the GenePattern ecosystem so that it can be analyzed.  


## GenePattern, GenePattern Notebook and your computer

<img src="https://datasets.genepattern.org/images/Workshop%20Notebooks/GettingDataIntoGPNB/gp_ecosystem.png?a=b" width="700px">

In todays workshop you have been working in a web browser on your own computer.  You are probably used to the idea that when you are in a browser you are getting the web pages from another computer somewhere out on the internet.  For this notebook, the **GenePattern Notebook Workspace** (GPNB) at https://notebook.genepattern.org is the remote host providing the notebook.  When your are executing Python code and UIBuilder cells (with the dark grey title bars), this is the computer that is doing the actual work.

However since the GPNB server has limited CPU, its memory is shared between multiple users, it runs synchronously (i.e. it ties up the notebook in your browser while jobs are running), for many analyses we want to off load them and let the GenePattern server run them instead. To handle big jobs and/or lots of jobs at the same time, the GenePattern server will borrow a computer with the desired memory and number of CPUs from a high-performance computing environment (AWS Batch, San Diego Super Computer's Expanse system, etc) and then sends the job there to be run.  When you are running GenePattern modules (with the blue title bars) this is what is happening.

The advantages of running analyses as GenePattern modules instead of in the notebook are;
- Supports many more simultaneous analyses
- Supports much larger available memory (up to 128+ GB)
- Supports many CPUs (up to 64+) for faster processing of parallel workloads
- support multiple programming languages (R, Python, Java, etc)
- Supports 'special' processors such as GPUs for analyses that can take advantage of them

Also, since your browser is not "in the loop" anymore, once an analysis is started you can shut down your computer after it is started and come back hours or days later to get the results without worrying about time-outs or sessions closing due to inactivity.

The downside of this is that now we have an additional server to which we must send your data files.  In this section we will cover sending data to the GenePattern server (and not the GenePattern Notebook Server) to be analyzed by GenePattern analysis modules.  We will detail getting data into the notebook itself later in this notebook.



## Uploading a file from your computer

When we want to send a data file to the GenePattern server, we can do it directly from a module's analysis cell.  We do this using the "Upload File" button that is provided for GenePattern modules. 

For example lets run a very simple module (ConvertLineEndings) below.  This analysis simply changes the line endings of a text file from Window's format to Linux format (don't worry if this means nothing to you, this is just an example module that runs pretty quickly).

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    <ol>
        <li>Click on the <b>"Upload File"</b> button beside the <b>"input filename *"</b> label.</li>
        <li>Select a small text file on your computer and hit the <b>"Open"</b> button.</li>
        <li>Click the <b>"Run"</b> button in the cell.</li>
        <li>When you are warned that this notebook is in demo mode you can choose either <b>"OK"</b> or <b>"Cancel"</b>.  Either is fine as we will not be using the outputs from this job.</li> 
    </ol>
    When this (or any) module's execution completes, it will add an output cell into the notebook with links to the job's output files.  
Don't worry about what to do with these files yet, we will look at some more interesting examples later in this notebook.


In [13]:
convertlineendings_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00002')
convertlineendings_job_spec = convertlineendings_task.make_job_spec()
convertlineendings_job_spec.set_parameter("input.filename", "")
convertlineendings_job_spec.set_parameter("output.file", "<input.filename_basename>.cvt.<input.filename_extension>")
convertlineendings_job_spec.set_parameter("job.memory", "2 Gb")
convertlineendings_job_spec.set_parameter("job.queue", "gp-cloud-default")
convertlineendings_job_spec.set_parameter("job.cpuCount", "1")
convertlineendings_job_spec.set_parameter("job.walltime", "02:00:00")
genepattern.display(convertlineendings_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00002')


When you hit "Run" your browser uploaded the file to the GenePattern server.  This bypassed the GenePattern Notebook server which will not see the file at all.  Once the GenePattern server got the file it did the analysis (getting a compute node, putting the data file onto it, launching the analyses, making sure it completes and finally capturing the output files).  The GenePattern notebook just sat on the sidelines watching this and once the job was done, it created the links to the output files that were generated.

Only module parameters that accept files will have the "Upload File" button present.  In the ConvertLineEndings module above the "output file*" parameter does not because that just wants a text filename to use for the output, not an actual file.

## Uploading a URL

As with uploading files, you can send the URL directly to the GenePattern server in a module's dialog.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    <ol>
        <li>Drag this link <code><a href="https://datasets.genepattern.org/data/all_aml/all_aml_train.gct">https://datasets.genepattern.org/data/all_aml/all_aml_train.gct</a></code> into the text area beside <b>"input filename *"</b> label (it says "Add File or URL") when empty.</li>
        <li>Click the <b>"Run"</b> button in the cell.</li>
    

In [14]:
convertlineendings_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00002')
convertlineendings_job_spec = convertlineendings_task.make_job_spec()
convertlineendings_job_spec.set_parameter("input.filename", "")
convertlineendings_job_spec.set_parameter("output.file", "<input.filename_basename>.cvt.<input.filename_extension>")
convertlineendings_job_spec.set_parameter("job.memory", "2 Gb")
convertlineendings_job_spec.set_parameter("job.queue", "gp-cloud-default")
convertlineendings_job_spec.set_parameter("job.cpuCount", "1")
convertlineendings_job_spec.set_parameter("job.walltime", "02:00:00")
genepattern.display(convertlineendings_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00002')

In this case when you hit run things are just slightly different than when you uploaded a file.  Instead of sending the file to GenePattern, it just sends the URL.  The GenePattern server gets a compute node as before and then downloads that URL on the compute node.  After that everything else proceeds as before.

<div class="alert alert-warning">
<p class="lead"> Important <i class="fa fa-exclamation-triangle"></i></p>
Any URLs you pass to GenePattern must be publicly accessible.  If a login is required to access it, the GenePattern compute node won't be able to read it.
</div>

## Uploading a file to the Notebook server

Remember that the GenePattern Notebook server is a different machine from GenePattern so we upload to it slightly differently.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    <b>Right click</b> on the <b>"GenePattern Notebook" header</b> at the very top-right of this page and select <b>"Open in new tab"</b>.
</div>

Now in the new tab (or window) you can see the main hub page you should see a directory listing showing
* \<This Notebook\>.ipynb
* test_data/
* <i>\<possibly some other files\></i>
    
You an upload additional files in that window as well if you want to put them on the GPNB server.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    <ol>
        <li>Read these instructions to the end (you will be working in another window).</li>
        <li>In your browser, <b>Click</b> on the tab you just opened.</li>
        <li>Click on the <b>"Upload" button</b></li>
        <li>Select a file on your laptop and then click <b>"OK"</b></li>
        <li>On the row for your file, <b>Click</b> on the new <b>"Upload"</b> button at the right edge. </li>
    </ol>    

This is the  simplest method for getting files onto your notebook server.  Once a file is there you can use Python to read the contents into a variable.  If you are not a Python Programmer don't worry because you can do almost everything in GenePattern and GenePattern Notebook without programming

## Uploading a file from the Notebook to GenePattern

Now we will do one more transfer, this time sending a file from the notebook server to the GenePattern server.


<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
<ol>
    <li>Upload the file named <code>'test_data/all_aml_test.gct'</code> in the current Notebook  folder by putting its name into the <b>"filename and path"</b> parameter in the first cell below</li>
    <li><b>Click</b> "Run".</li>
</div>

In [8]:
@genepattern.build_ui(name="Upload from Notebook server to GenePattern",collapse=False,description="Upload a file from the Notebook to the GenePattern server ", 
                      parameters={  "output_var":{"hide": True,},})
# upload the local file onto the GenePattern server so we can use it in the module
def uploadLocalNotebookFile(filename_and_path):
    filename = filenameAndPath.split("/")[1]
    uploaded = genepattern.get_session(0).upload_file(filename, filenameAndPath)
    uploaded_url = uploaded.get_url()
    display(genepattern.GPUIOutput(files=[uploaded_url]))

UIBuilder(collapse=False, description='Upload a file from the Notebook to the GenePattern server ', function_i…

What happened here is the UIBuilder cell above (titled: "Upload from Notebook Server to GenePattern") pushed the file to the GenePattern main server itself.  Then it displayed the URL on the GenePattern server.  You can now use this as an input URL to a module as in the earlier example.

Links on the GenePattern server itself are the exception to the rule about publicly accessible links.  Normally these links (including links to output files from GenePattern modules) are only usable by the logged in GenePattern user who created them but since this is happening within GenePattern itself it can handle all of the security checks internally.

# GenePattern Output Files

At this point its important to point out that the results from the analysis on the GenePattern server are still **ONLY** on the GenePattern server, and they are **NOT** in the notebook or on the notebook server.  To get the resulting data into the notebook we have to take an additional step.  Luckily the GPNB environment makes this very easy.  We can either use "Send to code" to get a Python file representation of the data out on the GenePattern server, or, for some data formats like GCT, we can "Send to Dataframe" which will pull the datafile from the file on the GenePattern server, into a python dataframe object within the notebook.


### Send to DataFrame
For certain common GenePattern file formats—such as GCT and ODF—we go a step beyond generating links to output files. With these we a,so make it easy to import and indexing that data into a Pandas Dataframe.

Dataframes are a structure used for many forms of tabular data in Python. They come with a large number of built-in functions to make it easy to analyze and visualize your data in python.  Pandas is an extremely popular Python library for reading and working with large datasets and adds additional useful functions to dataframes. It’s a staple of doing most any sort of data science in Python.

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>
<ol><li>In the Job result below, Click on the 
<code>`all_aml_train.cvt.gct`</code>
output and select <b>Send to Dataframe.</b> This will create a code cell immediately below the job cell.</li></ol></div>


In [15]:
job387655 = gp.GPJob(genepattern.session.get(0), 387655)
genepattern.display(job387655)

GPJobWidget(job_number=387655)


<p>For python programmers you can also use the "Send to Code" option. This will insert a snippet of code that accesses a GenePattern server result as a GPFile object that you can work with as if it were a local python File object.</p>


# Part 2: GenePattern File Formats

For the exercises you have just done, we have been working with "GCT" files.  This is one of several common formats that you might encounter when working with GenePattern, GenePattern Notebook and GenePattern modules.

The GenePattern server and GenePattern Notebook do not themselves care at all about file formats, but the analyses that you run through them generally do.  GenePattern modules will expect a particular format (e.g. a GCT file) while python functions in GenePattern Notebook may desire a different format (e.g. hdf5) or a specific kind (class) of python object (e.g. a Panda's dataframe).

The important thing to know is what is expected by the function or module that you want to use.  Usually this will be documented in the Notebook itself or the GenePattern module documentation.

Since we will be working with **gct** files in our analyses during this workshop, lets take a quick look at this file format.


## GCT File format

- Tab delimited
- Originally used for bulk Gene expression data but can be used for any matrix data
- Represents a Matrix
    - Columns = samples
    - rows = features such as genes
    
<img src="https://datasets.genepattern.org/images/Workshop%20Notebooks/GettingDataIntoGPNB/gct_format.png?a=b" width="700px"></img>

## File Format documentation

For details about these, and 26 of the other formats you will commonly encounter in GenePattern, you can refer to the <a href="https://www.genepattern.org/file-formats-guide">GenePattern File Formats Guide</a>.


GCT File Documentation:  <a href="https://www.genepattern.org/file-formats-guide#GCT">link</a>.


# Part 3: Getting your data into the GCT format

So given that many if not most GenePattern analysis modules like to work with data in this format, how do you get the data you already have ready to be used?


<div class="alert alert-warning">
<p class="lead"> Warning <i class="fa fa-exclamation-triangle"></i></p>
Do not use Microsoft Excel to manipulate your transcriptomic data.  It tends to automatically assign cell formats that can mess up your resulting file.
</div>

Getting your data into the gct format differs depending on what you are starting with.

If you already have a tab delimited file you might be able to simply edit it by adding the 2 header rows and the name and description columns.

If you are starting with RNASeq counts data, there are analysis modules that can create the gct file for you.

## Starting with a tab sepeated values (tsv) file

Often researchers will receive their data from a sequencing center in the form of a csv (comma seperated values) or tsv (tab seperated values) formatted text file.  Here we will take an example of such a file (provided by the UCSD Center for Computational Biology and Bioinformatics) and show you how to get it into gct format for processing in GenePattern.  This sample data is in the folder test_data/CCBB_Data in this project in a file called 'all_genes_results.txt'.

To convert this tab seperated file, we will use Pandas to load it into a dataframe and set the appropriate index, then we can use the GenePattern Python library to write it as a gct file in this notebook's file system.  

We have wrapped the actual code here with a UIBuilder cell which you can copy into other notebooks should you want to reuse it.  You will need to know the column names in the tsv file that correspond to the gene name and description, and the suffix for the value you with to use in the gct file (e.g. TPM, counts, FPKM).   

## Our example data:  testdata/CCBB_Data/all_genes_results.txt

Because of the large number of columns and the long column names its hard to easily view the head of this file in a browser since it gets too wide to display. Lets list the columns here vertically instead.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    <ol>
        <li>Set the <b>filenameAndpath</b> parameter to <code>"test_data/CCBB_Data/all_genes_results.txt"</code></li>
        <li>Click <b>'Run'</b>.</li>
    </ol>
</div>


In [9]:
@genepattern.build_ui(collapse=False, description="Display the column names from a text file.", 
                      parameters={  "output_var":{"hide": True,},})
def listColumnNames(filenameAndPath):
    df = pd.read_csv( filenameAndPath, sep='\t', header=0, index_col=False)
    display(list(df.columns.values))

UIBuilder(collapse=False, description='Display the column names from a text file.', function_import='nbtools.t…

So from this listing we can interpret the following details about the column names in this spreadsheet;
<table style="margin-left:10px;">
<tr><td>Index column name:</td><td>gene_id</td></tr>
<tr><td>Description column name:</td><td>  transcript_id(s)</td></tr>
<tr><td>counts column suffix:</td><td>  expected_count</td></tr>
<tr><td>TPM column suffix:</td><td> TPM</td></tr>
<tr><td>FPKM column suffix:</td><td>  FPKM</td></tr>
</table>

Now that we know the column name patterns we can use them to select which columns we want to put into our gct file.  For this example, we will put the counts columns into our gct matrix.  We will use the Pandas python library to transform our text file into a gct file we can analyze with downstream tools, but as before we have wrapped it in a GenePattern UIBuilder cell so that you can use it without dealing directly with the code.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    Enter the following parameter values;
    <ul>
        <li>input filename: <code>"test_data/CCBB_Data/all_genes_results.txt"</code></li>
        <li>output filename: <code>"all_genes_results.gct"</code></li>
        <li>index column: <code>"gene_id"</code></li>
        <li>description_column: <code>"transcript_id(s)"</code></li>
        <li>values column suffix: <code>"count"</code>
    </ul>
    Click <b>'Run'<b>.
</div>

In [10]:
import pandas as pd
import gp.data as gpdata

@genepattern.build_ui(collapse=False, description="Upload gct to the server ", 
                      parameters={  "output_var":{"hide": True,},})
# upload the local file onto the GenePattern server so we can use it in the module
def convertTabularDataIntoGCT(input_filename, output_filename, index_column, description_column, values_column_suffix):
    # read TSV into a data frame
    df = pd.read_csv(input_filename, sep='\t', header=0, index_col=False)
    # drop the default numeric row index
    df.reset_index(drop=True, inplace=True)
    # rename the first 2 columns to match norman GenePattern conventions of Name and Description
    df.rename(columns={index_column: "Name", description_column: "Description"}, inplace=True)
    # Set the Name and description as indexes which a GCT file expects in a dataframe
    df.set_index(["Name", "Description"], inplace=True);
    
    df2 = df.loc[:, df.columns.str.endswith(values_column_suffix)]
    # Show the top of the file
    display(df2.head())
    # Now write the dataframe out as a gct formatted file
    gpdata.write_gct(df2, output_filename)
    print("Wrote "+ output_filename)
    
    
    

UIBuilder(collapse=False, description='Upload gct to the server ', function_import='nbtools.tool(id="convertTa…

<h2 id="Replace-Ensembl-gene-ids-with-HUGO-symbols-and-remove-duplicates">Replace Ensembl gene ids with HUGO symbols and remove duplicates</h2>

<p>When you looked into the gct file that was output, you may have noticed that it uses Ensembl IDs for the rows (counts). We would like to change this to HUGO symbols to make this more human-friendly. When we do this, we may end up with multiple Ensembl ids that map to a single HUGO symbol. However since many of the analysis we will do later do not like to see duplicate rows, we will want to collapse instances of multiple transcripts down to a single row.</p>

<p>To do this we will use the CollapseDataset module which can collapse the rows and replace the Ensembl IDs with HUGO symbols in one step.  But before we can do that we first need to make one more adjustment to the data;</p>

<p>Our mapping file for CollapseDataset module (https://datasets.genepattern.org/data/TCGA_HTSeq_counts/ENSEMBL_human_gene.chip) was created using unversioned Ensembl gene ids, but our data file has versioned ids, so once more we will use the Pandas library, this time to strip out the version numbers  </p>



<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    <ol>
        <li>For <b>"filename"</b>, enter <code>"all_genes_results.gct"</code> (the file we just created)</li>
        <li>For <b>"output filename"</b> enter <code>"all_genes_unversioned.gct"</code></li>
        <li>Click <b>"Run"</b></li>
    </ol>
</div>

In [11]:
@genepattern.build_ui(description="Strip version numbers from ensemble ids and uploads to the GenePattern server.", 
                      parameters={  "output_var":{"hide": True,},})
# strip the version # from the ensemble gene id and upload
# the local file onto the GenePattern server so we can use it in the module
def stripEnsembleVersionIds(filename, output_filename):
    df = gpdata.GCT(filename)
    df.reset_index(drop=False, inplace=True)
    df['Name'] = df['Name'].apply(lambda x: x.split(".")[0])
    df.set_index(["Name", "Description"], inplace=True);
    display(df.head())
    gpdata.write_gct(df, output_filename)
    
    # and finally upload it to the GenePattern server from the Notebook server and display the resulting link
    uploaded = genepattern.get_session(0).upload_file(output_filename, output_filename)
    uploaded_url = uploaded.get_url()
    display(genepattern.GPUIOutput(files=[uploaded_url]))

UIBuilder(description='Strip version numbers from ensemble ids and uploads to the GenePattern server.', functi…

<p>Now we can take the resulting file and send it into the CollapseDataset Module to convert it to HUGO symbols and remove the duplicate  rows.</p>

<div class="alert alert-info">
<h3 style="margin-top: 0;"> Instructions <i class="fa fa-info-circle"></i></h3>

<ol>
    <li><b>Click</b> on the&nbsp;<b>dataset file</b>&nbsp;parameter below and select the&nbsp;<code>all_genes_unversioned.gct</code>&nbsp;result file from the&nbsp; cell result above.</li>
	<li>Leave the <strong>chip platform</strong> parameter as&nbsp;<code><A HREF="https://datasets.genepattern.org/data/TCGA_HTSeq_counts/ENSEMBL_human_gene.chip">https://datasets.genepattern.org/data/TCGA_HTSeq_counts/ENSEMBL_human_gene.chip</a></code> which should already be filled in.</li>
    <li>For the <b>collapse mode*</b> parameter select <code>Sum_of_probes</code></li>
    <li><b>Click</b> the &nbsp;<b>Run</b>&nbsp; button.</li>
    <li><b>Click</b> on the resulting gct file and select <b>"send to dataframe"</b> from the menu.</li>
</ol>
</div>

In [16]:
collapsedataset_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00134')
collapsedataset_job_spec = collapsedataset_task.make_job_spec()
collapsedataset_job_spec.set_parameter("dataset.file", "")
collapsedataset_job_spec.set_parameter("chip.platform", "https://datasets.genepattern.org/data/TCGA_HTSeq_counts/ENSEMBL_human_gene.chip")
collapsedataset_job_spec.set_parameter("collapse.mode", "Sum_of_probes")
collapsedataset_job_spec.set_parameter("output.file.name", "all_genes_hugo.gct")
collapsedataset_job_spec.set_parameter("omit.features.with.no.symbol.match", "true")
collapsedataset_job_spec.set_parameter("dev.mode", "false")
collapsedataset_job_spec.set_parameter("job.memory", "4 Gb")
collapsedataset_job_spec.set_parameter("job.queue", "gp-cloud-default")
collapsedataset_job_spec.set_parameter("job.cpuCount", "1")
collapsedataset_job_spec.set_parameter("job.walltime", "02:00:00")
genepattern.display(collapsedataset_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00134')

So now we have a gct file that is ready to be used in most of the downstream GenePattern analyses for clustering, Gene Set Enrichment Analysis (GSEA) etc.

### A word about what we just did

In the exercise we just did, we took 4 steps (cells) to convert the data into gct format.  Normally you will do most of these steps in a single cell and you would generate fewer intermediate files along the way. We have broken them out seperately here for instructional purposes only.



# Part 4: Reusing Cells from Today's Notebooks

As an exercise we are going to briefly look at how you can reuse a cell from today's workshop notebooks in the future when you are working with your own data.

In this Notebook Project, at the top level there is a directory called **Reusable Notebook Cell Library** which contains a notebook of the same name.  We have put copies of many of the UIBuilder cells you have seen or will see today into it to make them easy to reuse in the future.

Note that since they use the UIBuilder, you will need to make sure that there is a GenePattern login cell present in the notebook where you want to use them.  Also, when you paste them into a new notebook, initially you will just see the code view.  You have to 'Run' the cell for the UI to appear.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    <ol>
        <li><b>Right click</b> on the <b>"GenePattern Notebook"</b> icon at the top of this page and select <b>"Open Link in New Tab"</b> (or new window)</li>
        <li><b>Click</b> on the <b>"Reusable Notebook Cell Library"</b> folder.</li>
        <li><b>Click</b> on the <b>"Reusable Notebook Cell Library.ipnb"</b> notebook.</li>
        <li><b>Login</b> to GenePattern using the defaults</li>
        <li><b>Scroll down</b> the page and select any UIBuilder cell.</li>
        <li> <b>Click</b> the gear icon at the top right and choose "toggle code view".</li>
        <li> <b>Copy</b> the cell contents.  (select all and then Ctrl-C or Cmd-C)</li>
        <li><b>Click</b> back into this notebook and insert a new cell (Insert menu --> Insert Cell Below) </li>
        <li?><b>Paste</b> the copied code into the cell (Ctrl-V or Cmd-V).</li>
        <li><b>Run</b> the cell to display the UI</li>
 
</div>
</div>
</div>

# Appendix B CLS File Format

The cls file is used to tell analysis modules which class each sample belongs to.  For example you may want to specify which samples are tumor and which are notmal tissue to perform differential expression.

CLS File Documentation:  <a href="https://www.genepattern.org/file-formats-guide#CLS">link</a>.



- Tab or space delimited
- Used for defining labels or class membership of samples
- Values are numeric (0,1,…)
<img src="https://datasets.genepattern.org/images/Workshop%20Notebooks/GettingDataIntoGPNB/cls_format.png" width="700px"></img>

### Creating a CLS file

While you may have already seen the simple cls file format, an easy way to generate the file is to use the ClsFileCreator module. This module will read your gct file and then allow you to name classes and specify which samples belong to which.

Lets try it now.

<div class="alert alert-info">
<p class="lead"> Instructions <i class="fa fa-info-circle"></i></p>
    <ol>
        <li>For <b>"input file"</b> select the <code>"all_genes_unversioned_collapsed_to_symbols.gct"</code> file you just generated with Collapse Dataset.</li>
        <li>Click <b>"Run"</b></li>
        <li>On the first tab ("1. Samples") leave all samples checked and click <b>"Next"</b></li>
        <li>On the second tab ("2. Define Classes")
        <ol>
            <li>Enter <b>"Normal"</b> as a class name.</li>
            <li><b>Click the "+" button</b> to add the class.</li>
            <li>Enter <b>"Tumor"</b> as a class name. </li>
            <li><b>Click the "+" button</b> to add the class.</li>
            <li>Click <b>"Next"</b>.</li>
        </ol>
        </li>
         <li>On the third tab ("3. Assign Classes")
        <ol>
            <li><b>Click all the checkboxes</b> for the samples starting with <code>"P"</code> (you can use the filter to look for PN, PO and PC).</li>
            <li>Select class <b>"Normal"</b> on the right side.</li>
            <li><b>Click the arrow</b> to move the "P*" samples into the normal class. </li>
            <li><b>Click the class dropdown</b> to select the "Tumor" class.</li>
            <li><b>Click the checkboxes</b> for all the remaining samples (names starting with "C").</li>
            <li><b>Click the arrow</b> to move the "C*" samples into the tmor class.</li>
            <li>Click <b>"Next"</b>.</li>
        </ol>
        </li>
        <li>Click <b>"Next"</b> to move through the Summary tab.</li> 
        <li>On tab 5, Save, click the <b>"View/Download file"</b> radio button, and then <b>"Save"</b>.</li> 
    
</div>

In [17]:
clsfilecreator_task = gp.GPTask(genepattern.session.get(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00261')
clsfilecreator_job_spec = clsfilecreator_task.make_job_spec()
clsfilecreator_job_spec.set_parameter("input.file", "")
clsfilecreator_job_spec.set_parameter("job.memory", "2 Gb")
clsfilecreator_job_spec.set_parameter("job.queue", "gp-cloud-default")
clsfilecreator_job_spec.set_parameter("job.cpuCount", "1")
clsfilecreator_job_spec.set_parameter("job.walltime", "02:00:00")
genepattern.display(clsfilecreator_task)

GPTaskWidget(lsid='urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00261')