# Create Gene Signature Matrix Overview
```
Andrew E. Davidson
aedavids@ucsc.edu
9/8/22
```
The following code is a "poor man's" unit test demonstrating how to create the signature matrix. For each type/category (GTEx tissue id or TCGA cohort), we want to calculate the averge value for each gene. The pandas code is a little tricky

1) data pipe line  
* a. we used Salmon to create transcript counts for the GTEx and TCGA data sets
* b. these counts where combined into a single matrix
* c. we grouped the counts by geneId
* d. we split the matrix into 60/20/20 train, validate, and test data sets
* e. We ran [1vsAllTask.wdl](https://portal.firecloud.org/?return=terra#methods/aedavids.ucsc.edu/1vsAllTask.wdl/10) on the GTEx_TCGA_TrainGroupby.csv. 

The results can be found at [GTEx_TCGA_1vsAll](https://app.terra.bio/#workspaces/test-aedavids-proj/uber/data ). The results files are the output of DESeq2. we have 'log 2 fold change' not count data

2) we ran terra/jupyterNotebooks/signatureGenesUpsetPlots.ipynb on the 1vsAll results. It filters sets of candidate signature genes and creates upset plots. The sets include
* top 25 up regulated genes
* best 25 genes
* top 25 down regulated genes

For each of candidate signature gene sets we save the 1vsAll DESeq results.

**Some things to be aware of**
The GTEx_TCGA data set has a total of 83 categorical levels (types). Our best 25 candidate genes filter selects about 831 uniq genes not 2075 you might expect (83 * 25 = 2075). This is because of the way our 1vsAll model works. For each categorical level and for each gene we calculate a ration of the mean count for the level divided by the mean count for all other levels. A single gene may be up regulated in multiple levels. Our goal is the find the smallest set of signature genes

Our initial select criteria is naive. Based on the performance ciber sort we can user more sofistocated selection critera. E.G. select genes that are uniqe to a specific leves, if there are levels that do not have any genes, add in genes that are in only 2 levels, 3 levels, ...

3) to run cibersort we calculate signature gene profiles in "gene transcript count" space.

In [1]:
import pandas as pd

# use display() to print an html version of a data frame
# useful if dataFrame output is not generated by last like of cell
from IPython.display import display

## Create Mock GroupByGene Count Data

```
$ head GTEx_TCGA_TrainGroupby.csv |cut -d , -f 1-3
geneId,GTEX-1117F-0226-SM-5GZZ7,GTEX-1117F-0526-SM-5EGHJ
(A)n,9,1
(AAA)n,0,0
(AAAAAAC)n,0,0
(AAAAAAG)n,0,0
(AAAAAAT)n,0,0
(AAAAAC)n,0,0
(AAAAACA)n,0,0
(AAAAACC)n,0,0
(AAAAACT)n,0,0
```

In [2]:
# assumes we have already selected the signature genes
# s1 and s1b data are designed to make it easy to test
# if we calculate the expected values correctly
groupByGeneDF = pd.DataFrame( {
    'geneId':['g1', 'g2', 'g3', 'g4'] ,   
    "s1" :[  1,   2,   3,   4],
    "s1b":[1.5, 2.5, 3.5, 4.5],    
    "s2" :[ 10,  20,  30,  40],
    "s3" :[100, 200, 300, 400]
})

# set index to geneId. will make join easier' When we transpose
# the data frame the index will become the column names
groupByGeneDF = groupByGeneDF.set_index('geneId')

groupByGeneDF

Unnamed: 0_level_0,s1,s1b,s2,s3
geneId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
g1,1,1.5,10,100
g2,2,2.5,20,200
g3,3,3.5,30,300
g4,4,4.5,40,400


## Create the Mock ColData
This is the meta data required by DESeq

```
$ head GTEx_TCGA_TrainColData.csv
sample_id,participant_id,category,gender,age,dataSet
GTEX-1117F-0226-SM-5GZZ7,GTEX-1117F,Adipose_Subcutaneous,Female,66.0,GTEx
GTEX-1117F-0526-SM-5EGHJ,GTEX-1117F,Artery_Tibial,Female,66.0,GTEx
GTEX-1117F-0726-SM-5GIEN,GTEX-1117F,Heart_Atrial_Appendage,Female,66.0,GTEx
GTEX-1117F-2826-SM-5GZXL,GTEX-1117F,Breast_Mammary_Tissue,Female,66.0,GTEx
GTEX-1117F-3226-SM-5N9CT,GTEX-1117F,Brain_Cortex,Female,66.0,GTEx
GTEX-111CU-0326-SM-5GZXO,GTEX-111CU,Lung,Male,57.0,GTEx
GTEX-111CU-0426-SM-5GZY1,GTEX-111CU,Spleen,Male,57.0,GTEx
GTEX-111CU-0526-SM-5EGHK,GTEX-111CU,Pancreas,Male,57.0,GTEx
GTEX-111CU-0626-SM-5EGHL,GTEX-111CU,Esophagus_Muscularis,Male,57.0,GTEx
```

In [3]:
colDataDF = pd.DataFrame( {
    'sampleId':['s1', 's1b', 's2',   's3'] ,   
    "category":[ 'c1', 'c1', 'c2',   'c3']
})
colDataDF

Unnamed: 0,sampleId,category
0,s1,c1
1,s1b,c1
2,s2,c2
3,s3,c3


## Transpose the count data so that we can join the col data

In [4]:
# copy so we do not accidently change origina groupDF
transposeGroupByDF = groupByGeneDF.transpose(copy=True)
display(transposeGroupByDF)

geneId,g1,g2,g3,g4
s1,1.0,2.0,3.0,4.0
s1b,1.5,2.5,3.5,4.5
s2,10.0,20.0,30.0,40.0
s3,100.0,200.0,300.0,400.0


## Scale the counts
the estimated scaling factors where generated by DESeq.

In [5]:
scaleDF = pd.DataFrame( [1, 2, 3, 4])
display(scaleDF)
# element wise multiplication . use values to to multiply a vector
normalizedDF = transposeGroupByDF * scaleDF.values

normalizedDF

Unnamed: 0,0
0,1
1,2
2,3
3,4


geneId,g1,g2,g3,g4
s1,1.0,2.0,3.0,4.0
s1b,3.0,5.0,7.0,9.0
s2,30.0,60.0,90.0,120.0
s3,400.0,800.0,1200.0,1600.0


## Join the data frames

In [6]:
joinDF =  pd.merge(left=normalizedDF, 
                right=colDataDF, 
                how='inner', 
                left_index=True, #left_on="index",
                right_on="sampleId")


display(joinDF)

Unnamed: 0,g1,g2,g3,g4,sampleId,category
0,1.0,2.0,3.0,4.0,s1,c1
1,3.0,5.0,7.0,9.0,s1b,c1
2,30.0,60.0,90.0,120.0,s2,c2
3,400.0,800.0,1200.0,1600.0,s3,c3


## Calculate the expected Values

In [7]:
signatureDF = joinDF.groupby("category").mean()
display(signatureDF)

Unnamed: 0_level_0,g1,g2,g3,g4
category,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
c1,2.0,3.5,5.0,6.5
c2,30.0,60.0,90.0,120.0
c3,400.0,800.0,1200.0,1600.0


## Convert to CiberSort upload format
The math requires the signatue matrix to have dimensions (category x genes). The upload format is expected to have dimensions (genes x categories). Cibersort also required we set the column name for the geneId values to 'name'

In [8]:
#signatureDF.index.name = "name"
ciberSortSignatueDF = signatureDF.transpose()
ciberSortSignatueDF.index.name = "name"
ciberSortSignatueDF

category,c1,c2,c3
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
g1,2.0,30.0,400.0
g2,3.5,60.0,800.0
g3,5.0,90.0,1200.0
g4,6.5,120.0,1600.0


In [13]:
testFile = "AEDWIP.csv"
!rm $testFile
ciberSortSignatueDF.to_csv(testFile)
with open(testFile) as fp:
    lines = fp.readlines()
    for l in lines:
        print(l.strip())

name,c1,c2,c3
g1,2.0,30.0,400.0
g2,3.5,60.0,800.0
g3,5.0,90.0,1200.0
g4,6.5,120.0,1600.0
