# Making counts files for GSEA

In [22]:
import pandas as pd

Here we will be importing our clean counts file (.csv) format and changing the file format for GSEA. As you might remember from the Genepattern demo, we need to convert our counts and conditions files into compatible [formats](http://software.broadinstitute.org/cancer/software/genepattern/file-formats-guide) before we can use them. In our case, we need both a *GCT file* and a *CSL file.* 


To do so, we need to use some Genepattern tools to change the format. We will start with [MergeHTSeqCounts](http://software.broadinstitute.org/cancer/software/genepattern/modules/docs/MergeHTSeqCounts/1), which will take a modified version of our counts matrices and reformat them to be GCT files (details in provided link).

Let's start by importing our final counts file, that was saved as a CSV file

In [23]:
data_dir = "/home/ucsd-train01/projects/fto_shrna/deseq2/"

fto_counts = pd.read_csv(data_dir+"fto_counts_for_deseq2.csv",index_col=0,comment="#")

fto_counts.head()

Unnamed: 0_level_0,FTO_shrna_rep1,FTO_shrna_rep2,FTO_control_rep1,FTO_control_rep2
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ENSG00000227232.4,154,257,170,183
ENSG00000238009.2,126,165,159,176
ENSG00000237683.5,773,1079,890,931
ENSG00000239906.1,28,32,46,47
ENSG00000241860.2,84,95,96,101


In [24]:
# Since the input of the GCT file uses gene names, let's import that information from a gencode
# annotation file with our gene names and merge with our dataframe
names_dir = "/oasis/tscc/scratch/biom200/fto_clip/"

gene_names = pd.read_table(names_dir+"gencode.v19.annotation.genenames.txt", index_col=0)

fto_counts = fto_counts.join(gene_names)
fto_counts.head()

Unnamed: 0_level_0,FTO_shrna_rep1,FTO_shrna_rep2,FTO_control_rep1,FTO_control_rep2,sym
Geneid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
ENSG00000227232.4,154,257,170,183,WASH7P
ENSG00000238009.2,126,165,159,176,RP11-34P13.7
ENSG00000237683.5,773,1079,890,931,AL627309.1
ENSG00000239906.1,28,32,46,47,RP11-34P13.14
ENSG00000241860.2,84,95,96,101,RP11-34P13.13


In [25]:
# We will then go ahead and make the gene name column our (temporary) index
fto_counts = fto_counts.set_index(['sym'])
fto_counts.head()

Unnamed: 0_level_0,FTO_shrna_rep1,FTO_shrna_rep2,FTO_control_rep1,FTO_control_rep2
sym,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
WASH7P,154,257,170,183
RP11-34P13.7,126,165,159,176
AL627309.1,773,1079,890,931
RP11-34P13.14,28,32,46,47
RP11-34P13.13,84,95,96,101


In [26]:
# Now we want to make our individual counts columns into separate files. The file format will 
# be a tab delimited format (.txt) AND will be missing a header. Beacause we will eliminate the 
# header, it is IMPORTANT to name your individual files something meaningful:

FTO_shrna_rep1 = fto_counts['FTO_shrna_rep1']
FTO_shrna_rep1.head()

sym
WASH7P           154
RP11-34P13.7     126
AL627309.1       773
RP11-34P13.14     28
RP11-34P13.13     84
Name: FTO_shrna_rep1, dtype: int64

In [27]:
# We will now do this for the remainder of our samples 
FTO_shrna_rep2 = fto_counts['FTO_shrna_rep2']
FTO_control_rep1 = fto_counts['FTO_control_rep1']
FTO_control_rep2 = fto_counts['FTO_control_rep2']

In [28]:
# We now want to save theses as independent text files with no header 
FTO_shrna_rep1.to_csv(data_dir+"FTO_shrna_rep1.txt",header=False,sep="\t")

In [29]:
# What does out file look like now? Let's take a quick look:

FTO_shrna_r1_tab = pd.read_table(data_dir+"FTO_shrna_rep1.txt",header=None)
FTO_shrna_r1_tab.head()

Unnamed: 0,0,1
0,WASH7P,154
1,RP11-34P13.7,126
2,AL627309.1,773
3,RP11-34P13.14,28
4,RP11-34P13.13,84


In [30]:
# It looks like it fits the format that we want (delimited text file, first column contains 
# the gene symbol and the second column specifies the read count). Let's repeat for out other 
# columns.

FTO_shrna_rep2.to_csv(data_dir+"FTO_shrna_rep2.txt",header=False,sep="\t")
FTO_control_rep1.to_csv(data_dir+"FTO_control_rep1.txt",header=False,sep="\t")
FTO_control_rep2.to_csv(data_dir+"FTO_control_rep2.txt",header=False,sep="\t")

I'm now going to launch a Genepattern login session using the *Tools* icon located above:

In [13]:
# Requires GenePattern Notebook: pip install genepattern-notebook
import gp
import genepattern

# Username and password removed for security reasons.
genepattern.GPAuthWidget(genepattern.register_session("https://genepattern.broadinstitute.org/gp", "", ""))

**Making GCT files**

Once logged in, I will open the progam MergeHTSeqCounts using the tools icon again:

Make sure to upload ALL of your files into the input files section in the order that you would like them to be in the final dataframe. I choose [control_1, control_2, shrna_1, shrna_2]. Use sshfs to do this, then press *Run*:

In [18]:
mergehtseqcounts_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00354')
mergehtseqcounts_job_spec = mergehtseqcounts_task.make_job_spec()
mergehtseqcounts_job_spec.set_parameter("input.files", ["https://genepattern.broadinstitute.org/gp/users/rmarina%40ucsd.edu/tmp/run9117143890045340230.tmp/FTO_control_rep1.txt", "https://genepattern.broadinstitute.org/gp/users/rmarina%40ucsd.edu/tmp/run7772597350920226555.tmp/FTO_control_rep2.txt", "https://genepattern.broadinstitute.org/gp/users/rmarina%40ucsd.edu/tmp/run4036849629561514713.tmp/FTO_shrna_rep1.txt", "https://genepattern.broadinstitute.org/gp/users/rmarina%40ucsd.edu/tmp/run7783167360217589325.tmp/FTO_shrna_rep2.txt"])
mergehtseqcounts_job_spec.set_parameter("output.prefix", "FTO_shRNA_k562")
genepattern.GPTaskWidget(mergehtseqcounts_task)

In [20]:
job1557259 = gp.GPJob(genepattern.get_session(0), 1557259)
genepattern.GPJobWidget(job1557259)

**Making CLS files**
Download and take a look at the text file with a text editor on your local computer. Does it look like how a GCT file is supposed to? If so, we can move to the next stage, which involves creating our CLS file from out GCT. This will define our conditions from our experiment.

- Move the output file of MergeHTSeqCounts to the ClsFileCreator input file area and run:
- Check all of your samples and assign them a class (eg. knockdown, control). 
- Assign your samples to their respective classes with the arrow icons. Classes can be    controlled with the pulldown tab on the right. 
- Once everything looks in order, review and save.

In [14]:
clsfilecreator_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.visualizer:00261')
clsfilecreator_job_spec = clsfilecreator_task.make_job_spec()
clsfilecreator_job_spec.set_parameter("input.file", "https://genepattern.broadinstitute.org/gp/jobResults/1556971/FTO_shRNA_k562.gct")
genepattern.GPTaskWidget(clsfilecreator_task)

In [15]:
job1556972 = gp.GPJob(genepattern.get_session(0), 1556972)
genepattern.GPJobWidget(job1556972)

**Modifying GCT file (getting rid of redundant column**

We are ALMOST ready to do our GSEA analysis. We have one more thing that we need to do before we are ready, and that is get rid of an extra column in our GCT file. To do this we will:

1.) Download out GCT file to our local computer. It should be in a tab-delimited GCT file format that can be opened through text editor.
2.) Change the file extension on the .gct file to .txt. This will ensure Excel will recognize and open out file.
3.) Open Microsoft Excel and open a new workbook. Go to File and select Import and choose txt file. This will make a make a cell-separated file that can then be edited in Excel.
4.) Go to the Descriptions column. Notice that these cells are the same as the Name column. Keeping the header ("Descriptions"), clear the contents of the remaining cells below. This will ensure that our Descriptions column is empty.
5.) "Save As" a tab-delimited text file (.txt). 
6.) Rename this file as a .gct file, and make sure to get rid of any .txt extensions. GSEA will only recognize files that have a .gct extension.

Now we should have everything we need for GSEA analysis:

In [16]:
gsea_task = gp.GPTask(genepattern.get_session(0), 'urn:lsid:broad.mit.edu:cancer.software.genepattern.module.analysis:00072')
gsea_job_spec = gsea_task.make_job_spec()
gsea_job_spec.set_parameter("expression.dataset", "https://genepattern.broadinstitute.org/gp/users/rmarina%40ucsd.edu/tmp/run8238873762206304373.tmp/FTO_counts_for_GSEA.gct")
gsea_job_spec.set_parameter("gene.sets.database", "h.all.v6.0.symbols.gmt")
gsea_job_spec.set_parameter("gene.sets.database.file", "")
gsea_job_spec.set_parameter("number.of.permutations", "1000")
gsea_job_spec.set_parameter("phenotype.labels", "https://genepattern.broadinstitute.org/gp/users/rmarina%40ucsd.edu/tmp/run4525946911680508477.tmp/FTO_shRNA_k562%20%281%29.cls")
gsea_job_spec.set_parameter("target.profile", "")
gsea_job_spec.set_parameter("collapse.dataset", "false")
gsea_job_spec.set_parameter("permutation.type", "gene_set")
gsea_job_spec.set_parameter("chip.platform.file", "")
gsea_job_spec.set_parameter("scoring.scheme", "weighted")
gsea_job_spec.set_parameter("metric.for.ranking.genes", "Signal2Noise")
gsea_job_spec.set_parameter("gene.list.sorting.mode", "real")
gsea_job_spec.set_parameter("gene.list.ordering.mode", "descending")
gsea_job_spec.set_parameter("max.gene.set.size", "500")
gsea_job_spec.set_parameter("min.gene.set.size", "15")
gsea_job_spec.set_parameter("collapsing.mode.for.probe.sets.with.more.than.one.match", "Max_probe")
gsea_job_spec.set_parameter("normalization.mode", "meandiv")
gsea_job_spec.set_parameter("randomization.mode", "no_balance")
gsea_job_spec.set_parameter("omit.features.with.no.symbol.match", "true")
gsea_job_spec.set_parameter("make.detailed.gene.set.report", "true")
gsea_job_spec.set_parameter("median.for.class.metrics", "false")
gsea_job_spec.set_parameter("number.of.markers", "100")
gsea_job_spec.set_parameter("plot.graphs.for.the.top.sets.of.each.phenotype", "20")
gsea_job_spec.set_parameter("random.seed", "timestamp")
gsea_job_spec.set_parameter("save.random.ranked.lists", "false")
gsea_job_spec.set_parameter("output.file.name", "<expression.dataset_basename>.zip")
genepattern.GPTaskWidget(gsea_task)

In [21]:
job1557266 = gp.GPJob(genepattern.get_session(0), 1557266)
genepattern.GPJobWidget(job1557266)