# Find Candidate Enrichment Biomarkers
Andrew E. Davidson  
aedavids@ucsc.edu   
5/16/24  

Copyright (c) 2020-2023, Regents of the University of California All rights reserved. https://polyformproject.org/licenses/noncommercial/1.0.0

ref: 
- deconvolutionAnalysis/doc/addDegree2Genes.md
- deconvolutionAnalysis/doc/bestCuratedNotes.md
- intraExtraRNA_POC/adenocarcinoma.vs.control/enrichESCA.ipynb
- deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/findCandidateEnrichmentBiomarkers.ipynb


**<span style="color:red;background-color:yellow">TODO</span>**  
- check for unused degree 1 genes
- check if degree 1 are down regulated. Vikas and Daniel think it is suprising to find down regulated genes in cancer. Do we see this down regulation in other cancers?

----
# **<span style="color:red;background-color:yellow">Bone Yard</span>**

comments bellow are buggy
- best10CuratedDegree1_ce467ff was sorted in ascending order
- do not use"best500LFC_FindAllDegree1_wl500" 
- use "best500FindAllDegree1_wl500"

<span style="color:red;background-color:yellow">('ESCA', 'STAD') : len(v) = 167</span>  
STAD is 'stomach adenocarcinoma'. Follow up I am not sure if this is know biology or a hypothesis. "acid reflux causes stomach cells to become cancerous esophagus cells"

**top 2 candidate genes based on differentical expression**  
ANKRD36C, FGF19, AL031708.1

**mis classificaiton error metrics**  
see deconvolutionAnalysis/doc/addDegree2Genes.md

## Abstract
Goal: improve 
1. GTEx_TCGA deconvolution for ESCA, STAD, Esophagus_Mucosa, and Stomach. 
2. elife random forest hyper parameter tunning results
3. nanoporeAdenocarcinomaBinaryClassification.ipynb

see see deconvolutionAnalysis/doc/addDegree2Genes.md for description

**overview** 
1. check for unused degree 1 genes
2. find degree 2 ESCA Genes
3. ignore STAD and Esophagus_Mucosa. These cancer and tissue types probably similar to ESCA, STAD, and Stomach
4. for each gene compare differential expresxion results between other degree 2 class. Avoid genes with marignially different expressions
5. explore classification errors. See deconvolutionAnalysis/doc/addDegree2Genes.md. Avoid adding genes for class with only a small number of missclassificaiton errors

# <span style="color:red;background-color:yellow">Bug?</span>
It looks like there may have been a bug in BestCuratedGeneConfig. findGenes return sort_value(by="baseMean") I think this was fixed after best10CuratedDegree1_ce467ff because it looks like ESCA for this run are genes with the lowest values for baseMean. Not sure when ascending=False was added. Try re-run

```
aedavids@mustard $ ls -ld best10CuratedDegree1 best10CuratedDegree1_ce467ff
drwxr-sr-x 3 aedavids kimlab 1 Jan 11 18:39 best10CuratedDegree1/
drwxr-sr-x 3 aedavids kimlab 1 Jan  9 10:44 best10CuratedDegree1_ce467ff/

/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/best10CuratedDegree1/training/best10CuratedDegree1.sh.out/GTEx_TCGA-design-tilda_gender_category-padj-0001-lfc-20-n-10

 $ cut -d , -f 1,2,3 ESCA_vs_all.results 
name,baseMean,log2FoldChange
MCRIP1, 2870.33818751656,-2.03709482774914
IFFO1,1498.01418347192,-2.00338443019926
CAMK1,1058.83394372702,-2.06536653484754
ZNF667-AS1,860.038393899797,-2.81341430542468
KHDRBS3,787.155544345561,-2.07207508901871
DNALI1,690.194131887099,-2.16737852595459
ZNF471,606.988558770845,-2.16316671279139
C3orf18,601.702161492102,-2.44478529553007
FAM229B,575.05392665072,-2.17126693888998
GGTA1P,544.313205080983,-2.13903226293733

```



In [1]:
import ipynbname

# use display() to print an html version of a data frame
# useful if dataFrame output is not generated by last like of cell
from IPython.display import display

import numpy as np
import os
import pandas as pd

import pprint as pp
import matplotlib.pyplot as plt

import sys

notebookName = ipynbname.name()
notebookPath = ipynbname.path()
notebookDir = os.path.dirname(notebookPath)

#outDir = f'{notebookDir}/{notebookName}.out'
outDir = f'/private/groups/kimlab/aedavids/elife/{notebookName}.out'
os.makedirs(outDir, exist_ok=True)
print(f'outDir:\n{outDir}')

# results of hyperparmeter search
#hyperparameterOut = "/private/groups/kimlab/aedavids/elife/hyperparmeterTunning"

imgOut = f'{outDir}/img'
os.makedirs(imgOut, exist_ok=True)
print(f'\nimgOut :\n{imgOut}')

import logging
loglevel = "INFO"
#loglevel = "WARN"
# logFMT = "%(asctime)s %(levelname)s [thr:%(threadName)s %(name)s %(funcName)s() line:%(lineno)s] [%(message)s]"
logFMT = "%(asctime)s %(levelname)s %(name)s %(funcName)s() line:%(lineno)s] [%(message)s]"
logging.basicConfig(format=logFMT, level=loglevel)    
logger = logging.getLogger(notebookName)

meaningOfLife = 42

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


outDir:
/private/groups/kimlab/aedavids/elife/findCandidateEnrichmentBiomarkers.out

imgOut :
/private/groups/kimlab/aedavids/elife/findCandidateEnrichmentBiomarkers.out/img


In [2]:
# setting the python path allows us to run python scripts from using
# the CLI. 
ORIG_PYTHONPATH = os.environ['PYTHONPATH']

####### config deconvolutionModules
deconvolutionModules = notebookPath.parent.joinpath("../../../deconvolutionAnalysis/python/")
print("deconvolutionModules: {}\n".format(deconvolutionModules))

PYTHONPATH = ORIG_PYTHONPATH + f':{deconvolutionModules}'
print("PYTHONPATH: {}\n".format(PYTHONPATH))

##### config intraExtraRNA_POCModules
intraExtraRNA_POCModules=notebookPath.parent.joinpath("../../python/src")
print("intraExtraRNA_POCModules: {}\n".format(intraExtraRNA_POCModules))

PYTHONPATH = PYTHONPATH + f':{intraExtraRNA_POCModules}'
print("PYTHONPATH: {}\n".format(PYTHONPATH))

###### set new PYTHONPATH
os.environ["PYTHONPATH"] = PYTHONPATH
PYTHONPATH = os.environ["PYTHONPATH"]
print("PYTHONPATH: {}\n".format(PYTHONPATH))

###### set sys.path
# to be able to import our local python files we need to set the sys.path
# https://stackoverflow.com/a/50155834
sys.path.append( str(deconvolutionModules) )
sys.path.append( str(intraExtraRNA_POCModules) )
print("\nsys.path:\n{}\n".format(sys.path))

deconvolutionModules: /private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../../deconvolutionAnalysis/python

PYTHONPATH: :/private/home/aedavids/extraCellularRNA/src:/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../../deconvolutionAnalysis/python

intraExtraRNA_POCModules: /private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../python/src

PYTHONPATH: :/private/home/aedavids/extraCellularRNA/src:/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../../deconvolutionAnalysis/python:/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../python/src

PYTHONPATH: :/private/home/aedavids/extraCellularRNA/src:/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../../deconvolutionA

In [3]:
# import local 
from analysis.bestSignatureGeneConfig import BestSignatureGeneConfig
from analysis.utilities import findElementsInIntersectionsWithDegree
from analysis.utilities import findIntersectionsWithDegree
from analysis.utilities import loadDictionary

## Check for unused degree 1 genes

In [4]:
className = "ESCA"
# runName = "best10CuratedDegree1_ce467ff"
runName = "best10CuratedDegree1"

print(f'# Load {runName}  {className} biomarkers')

runRoot = "/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category"
runOut = f'{runRoot}/{runName}/training/best10CuratedDegree1.sh.out'
deseqOut = f'{runOut}/GTEx_TCGA-design-tilda_gender_category-padj-0001-lfc-20-n-10'
runPath = f'{deseqOut}/{className}_vs_all.results'
print(f'runPath :\n{runPath}')

runDF = pd.read_csv(runPath).sort_values(by="baseMean", ascending=False)
runDF

# Load best10CuratedDegree1  ESCA biomarkers
runPath :
/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/best10CuratedDegree1/training/best10CuratedDegree1.sh.out/GTEx_TCGA-design-tilda_gender_category-padj-0001-lfc-20-n-10/ESCA_vs_all.results


Unnamed: 0,name,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
0,MCRIP1,2870.338188,-2.037095,0.07937,-25.665794,2.817612e-145,2.459381e-143
1,IFFO1,1498.014183,-2.003384,0.098171,-20.406996,1.449221e-92,6.291511e-91
2,CAMK1,1058.833944,-2.065367,0.107973,-19.128501,1.462191e-81,5.113821e-80
3,ZNF667-AS1,860.038394,-2.813414,0.130148,-21.617029,1.242211e-103,6.578132e-102
4,KHDRBS3,787.155544,-2.072075,0.139028,-14.904036,3.1025839999999997e-50,5.026478e-49
5,DNALI1,690.194132,-2.167379,0.156332,-13.863962,1.047308e-43,1.396829e-42
6,ZNF471,606.988559,-2.163167,0.12175,-17.767274,1.266994e-70,3.5032810000000003e-69
7,C3orf18,601.702161,-2.444785,0.119278,-20.4966,2.308739e-93,1.0216729999999999e-91
8,FAM229B,575.053927,-2.171267,0.142517,-15.235105,2.06795e-52,3.6009079999999996e-51
9,GGTA1P,544.313205,-2.139032,0.15688,-13.634813,2.4862889999999998e-42,3.169326e-41


In [5]:
biomarkers = runDF.loc[:,"name"].to_list()
biomarkerSet = set(biomarkers)

In [6]:
# load upstream upset plot intersection dictionary
upstreamRunName = "best500FindAllDegree1_wl500"
upstreamRoot = f'/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/{upstreamRunName}' 
upstreamOut = f'{upstreamRoot}/training/{upstreamRunName}.sh.out'
upsetPlotOut=f'{upstreamOut}/upsetPlot.out'

upstreamPath = f'{upsetPlotOut}/best500_findAllDegree1_wl500.intersection.dict'
print(f'upstreamPath :\n{upstreamPath}')
upstreamPathIntersectionDict = loadDictionary( upstreamPath)

upstreamPath :
/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/best500FindAllDegree1_wl500/training/best500FindAllDegree1_wl500.sh.out/upsetPlot.out/best500_findAllDegree1_wl500.intersection.dict


In [7]:
degree1Dict = findIntersectionsWithDegree(
                    upstreamPathIntersectionDict, 
                    degree=1)

In [8]:
def findBiomarkers(
    intersectionDict : dict[ list, list ], 
    setName: str) -> dict[ list[str], list[str] ]:
    '''
    arguments
        intersectionDict
            key: muliti index of GTEx, or TCGA classes
                example:  ('Liver', 'PRAD', 'UVM')
                
            values: list of biomarkers
                example : ['ABCC11'],
        setName :
            a GTEx, or TCGA classes
                example : 'Liver'

    returns
        a dictionary, all the keys will contain setName
    '''

    retDict = dict()
    for key,values in intersectionDict.items():
        if setName in key:
            retDict[key] = values

    return retDict

In [9]:
def viewDict( intersectionDict ) :
    for k,v in intersectionDict.items():    
        if len(v) < 5:
            print(f'{str(k)} : {v}')
        else :
            print(f'{k} : len(v) = {len(v)}')

In [10]:
print(f'className : {className}')
classD1Dict= findBiomarkers(degree1Dict, setName=className)

viewDict( classD1Dict )

className : ESCA
('ESCA',) : len(v) = 104


In [11]:
# load all upstream biomarkers for className
upstreamClassResultsPath = f'{upstreamOut}/GTEx_TCGA-design-tilda_gender_category-padj-0001-lfc-20-n-500/{className}_vs_all.results'
upstreamClassResultsDF = pd.read_csv(upstreamClassResultsPath, index_col="name").sort_values(by="baseMean", ascending=False)
upstreamClassResultsDF['i'] = [i for i in range(upstreamClassResultsDF.shape[0])]
print(f'{className} upstreamClassResultsDF.shape : {upstreamClassResultsDF.shape}')
upstreamClassResultsDF.head()

ESCA upstreamClassResultsDF.shape : (583, 7)


Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,i
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
CLU,56070.78888,-3.172978,0.177939,-17.831859,3.999023e-71,1.12155e-69,0
GPX3,40969.907575,-3.095799,0.194927,-15.881841,8.465629999999999e-57,1.6297920000000001e-55,1
MALAT1,33662.063256,3.41139,0.111355,30.635216,4.159132e-206,7.230045e-204,2
FHL1,28068.059244,-2.172641,0.191624,-11.338056,8.50116e-30,7.125927e-29,3
APOE,22484.798309,-3.139022,0.208721,-15.039344,4.0554799999999995e-51,6.756236e-50,4


In [12]:
# find all degree 1 genes for class
key = ( className, )
print(f'key : {key}')
# print(degree1Dict.keys())
allD1DFGenes = degree1Dict[key]
print(f'len(allD1DFGenes) : {len(allD1DFGenes)}')

allD1DF = upstreamClassResultsDF.loc[ allD1DFGenes, :].sort_values(by="baseMean", ascending=False)
print(f'allD1DF.shape : {allD1DF.shape}')
allD1DF

key : ('ESCA',)
len(allD1DFGenes) : 104
allD1DF.shape : (104, 7)


Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,i
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
MCRIP1,2870.338188,-2.037095,0.079370,-25.665794,2.817612e-145,2.459381e-143,85
IFFO1,1498.014183,-2.003384,0.098171,-20.406996,1.449221e-92,6.291511e-91,171
CAMK1,1058.833944,-2.065367,0.107973,-19.128501,1.462191e-81,5.113821e-80,232
ZNF667-AS1,860.038394,-2.813414,0.130148,-21.617029,1.242211e-103,6.578132e-102,292
KHDRBS3,787.155544,-2.072075,0.139028,-14.904036,3.102584e-50,5.026478e-49,318
...,...,...,...,...,...,...,...
PRELID1P1,38.467183,-3.566584,,,,5.328272e-202,578
HERVFH19-int,37.909259,2.465980,,,,3.311269e-92,579
UBE2SP2,37.234810,-4.912241,,,,1.487127e-68,580
(TA)n,37.185061,3.689817,,,,3.809100e-261,581


## <span style="color:red;background-color:yellow">Unused upstream degree 1 genes</span>

In [13]:
# remove any biomarkers we have already used
selectRows = ~ allD1DF.index.isin( biomarkerSet )
candidateD1DF = allD1DF.loc[ selectRows, :].sort_values(by="baseMean", ascending=False)
print(f'{className} candidateD1DF.shape : {candidateD1DF.shape}')
candidateD1DF.head(n=10)

ESCA candidateD1DF.shape : (94, 7)


Unnamed: 0_level_0,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,i
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
HOXB13,542.321069,-2.489782,0.445972,-5.582818,2.366523e-08,6.906451e-08,408
VSIG10L,528.069988,2.227956,0.189023,11.786699,4.571068e-32,4.134433e-31,413
MER33,523.170063,2.2542,0.070298,32.066308,1.3008409999999999e-225,2.516119e-223,416
FAM83B,495.695843,2.084443,0.317496,6.565254,5.194428e-11,1.812221e-10,430
HSD17B14,490.322725,-2.783174,0.145223,-19.164801,7.283435000000001e-82,2.555965e-80,434
TMPRSS11E,463.462702,2.125285,0.412238,5.155485,2.529759e-07,6.858618e-07,442
HLA-J,463.379825,-2.03664,0.146011,-13.948532,3.211464e-44,4.338004e-43,443
PRSS27,461.448241,2.517073,0.226073,11.133903,8.579525e-29,6.936182e-28,445
SDSL,441.427211,-2.202559,0.137257,-16.04701,5.998517e-58,1.193302e-56,451
AC027290.2,398.577205,2.010041,0.121174,16.588108,8.495541000000001e-62,1.873703e-60,481


## Find Candidate ESCA Degree 2 Genes
Avoid Genes shared between ESCA and Esophagus

In [14]:
degree2Dict = findIntersectionsWithDegree(
                    upstreamPathIntersectionDict, 
                    degree=2)

In [15]:
ESCAIntersection_D2_Dict = findBiomarkers(degree2Dict, "ESCA")
viewDict( ESCAIntersection_D2_Dict )

('BLCA', 'ESCA') : ['PNMA8B', 'SLC16A4', 'NALCN']
('CESC', 'ESCA') : ['RTL5', 'HSPB2']
('CHOL', 'ESCA') : ['PLPP7']
('COAD', 'ESCA') : ['MAGEH1']
('ESCA', 'GBM') : ['RPL22P1']
('ESCA', 'KIRP') : ['SAMD9']
('ESCA', 'LUSC') : ['SLC13A3']
('ESCA', 'OV') : ['(CCAT)n']
('ESCA', 'PRAD') : ['PLPP1']
('ESCA', 'STAD') : len(v) = 28
('ESCA', 'TGCT') : ['LINC01002']
('ESCA', 'UCEC') : ['SPOCK3', 'GPIHBP1', 'FAT3']
('ESCA', 'UCS') : ['MEOX2']


## Compare Differential Expression Values

In [16]:
def findCandidateBiomarkers(
    intersectionDict : dict[ list, list ],
    ignore : list[str],
    ):
    '''
    TODO
    '''
    retList = []
    ignoreSet = set(ignore)
    #t is tuple of set names
    for t,v in intersectionDict.items():
        setName = set(t)
        if len( setName.intersection(ignoreSet) ) == 0:
            print(f'adding biomarkers from {setName}')
            retList = retList + v

    return retList

allESCA_sharedGenes = findCandidateBiomarkers( ESCAIntersection_D2_Dict, ignore=['Esophagus_Mucosa', 'STAD'])
allESCA_sharedGenes

adding biomarkers from {'ESCA', 'BLCA'}
adding biomarkers from {'ESCA', 'CESC'}
adding biomarkers from {'ESCA', 'CHOL'}
adding biomarkers from {'COAD', 'ESCA'}
adding biomarkers from {'ESCA', 'GBM'}
adding biomarkers from {'KIRP', 'ESCA'}
adding biomarkers from {'ESCA', 'LUSC'}
adding biomarkers from {'OV', 'ESCA'}
adding biomarkers from {'ESCA', 'PRAD'}
adding biomarkers from {'ESCA', 'TGCT'}
adding biomarkers from {'ESCA', 'UCEC'}
adding biomarkers from {'ESCA', 'UCS'}


['PNMA8B',
 'SLC16A4',
 'NALCN',
 'RTL5',
 'HSPB2',
 'PLPP7',
 'MAGEH1',
 'RPL22P1',
 'SAMD9',
 'SLC13A3',
 '(CCAT)n',
 'PLPP1',
 'LINC01002',
 'SPOCK3',
 'GPIHBP1',
 'FAT3',
 'MEOX2']

In [17]:
# it is faster to load these results the resuls files are only 500 lines long
deseqResultsDir= f'{upstreamOut}/GTEx_TCGA-design-tilda_gender_category-padj-0001-lfc-20-n-500'
print(f'{deseqResultsDir}')

def loadResultsAndSelect(path : str, names : str, index_col : str="name") :
    df = pd.read_csv(path, index_col="name")
    #print(f'geneNames : {names}')
    resultsDF = df.loc[names, :]
    resultsDF = resultsDF.reset_index()    

    return resultsDF
    
def xxx(
    intersectionDict : dict[ list, list ], 
    rootSetName : str, 
    sharedGenes : list[str],
    ignore : list[str] = []
    ):
    '''
    TODO
    '''
    # select all the share genes
    foundSet = {rootSetName}
    resultsPath = f'{deseqResultsDir}/{rootSetName}_vs_all.results'
    # todo clean this up it is sloppy
    # rootSetNames = list(set(sharedGenes) - set(ignore))
    # print(f'rootSetNames:\n{rootSetNames}')
    retDF = loadResultsAndSelect( resultsPath,  sharedGenes)
    retDF['source'] = rootSetName

    for keys,geneNames in intersectionDict.items():
        for k in keys:
            if (k not in foundSet)  and (k not in ignore):
                hack = { n  for n in geneNames} # use comprehension to create a set
                foundSet = foundSet.union( hack )
                resultsPath = f'{deseqResultsDir}/{k}_vs_all.results'
                resultsDF = loadResultsAndSelect( resultsPath, geneNames )
                resultsDF['source'] = [k]*resultsDF.shape[0] 
                print(f'k : {k} geneNames : {geneNames} ')
                tmpDF = pd.concat([retDF, resultsDF])
                retDF = tmpDF

    return retDF.reset_index()

resultsDF = xxx( ESCAIntersection_D2_Dict, 'ESCA', allESCA_sharedGenes, ignore=['Esophagus_Mucosa', 'STAD'])
resultsDF.sort_values(by=["name", ])

#('ESCA', 'Esophagus_Mucosa') : ['TMPRSS11BNL', 'KRT24', '(TGGCCC)n']
# ('ESCA', 'STAD') : len(v) = 167

/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/best500FindAllDegree1_wl500/training/best500FindAllDegree1_wl500.sh.out/GTEx_TCGA-design-tilda_gender_category-padj-0001-lfc-20-n-500
k : BLCA geneNames : ['PNMA8B', 'SLC16A4', 'NALCN'] 
k : CESC geneNames : ['RTL5', 'HSPB2'] 
k : CHOL geneNames : ['PLPP7'] 
k : COAD geneNames : ['MAGEH1'] 
k : GBM geneNames : ['RPL22P1'] 
k : KIRP geneNames : ['SAMD9'] 
k : LUSC geneNames : ['SLC13A3'] 
k : OV geneNames : ['(CCAT)n'] 
k : PRAD geneNames : ['PLPP1'] 
k : TGCT geneNames : ['LINC01002'] 
k : UCEC geneNames : ['SPOCK3', 'GPIHBP1', 'FAT3'] 
k : UCS geneNames : ['MEOX2'] 


Unnamed: 0,index,name,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,source
27,0,(CCAT)n,755.411436,-2.855094,0.108985,-26.197239,2.857008e-151,4.419514e-149,OV
10,10,(CCAT)n,755.411436,-2.155953,0.135462,-15.915561,4.9421649999999996e-57,9.613421e-56,ESCA
32,2,FAT3,382.707236,-2.172076,0.239019,-9.087445,1.013936e-19,1.076538e-18,UCEC
15,15,FAT3,382.707236,-2.273265,0.2322,-9.790113,1.241576e-22,7.809420000000001e-22,ESCA
31,1,GPIHBP1,575.63202,-2.786757,0.193298,-14.416862,4.053598e-47,2.79225e-45,UCEC
14,14,GPIHBP1,575.63202,-3.24488,0.187535,-17.302783,4.482072e-67,1.1363499999999999e-65,ESCA
4,4,HSPB2,703.213665,-3.130593,0.167985,-18.636189,1.634798e-77,5.2866580000000005e-76,ESCA
21,1,HSPB2,703.213665,-2.825345,0.133789,-21.117863,5.4505580000000005e-99,4.845313e-97,CESC
29,0,LINC01002,2323.659751,2.518855,0.215464,11.690392,1.427263e-31,1.668377e-30,TGCT
12,12,LINC01002,2323.659751,2.246173,0.18301,12.273488,1.2571779999999999e-34,1.242074e-33,ESCA


In [18]:
resultsDF.describe()

Unnamed: 0,index,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
count,34.0,34.0,34.0,34.0,34.0,34.0,34.0
mean,4.205882,902.905785,-1.668673,0.182184,-10.587314,1.896088e-10,1.717119e-09
std,5.238433,741.57682,1.86943,0.081784,13.611407,9.096264e-10,7.254297e-09
min,0.0,382.707236,-3.24488,0.073816,-26.371574,0.0,0.0
25%,0.0,487.218965,-2.665226,0.126957,-18.302838,4.2668349999999997e-78,1.394718e-76
50%,1.5,611.284233,-2.173472,0.174448,-14.348325,1.324759e-53,2.367984e-52
75%,7.75,877.161884,-2.02416,0.208954,-9.191792,7.311197e-31,6.416122e-30
max,16.0,3225.374676,3.238897,0.403603,39.522346,5.191503e-09,3.733464e-08


In [19]:
resultsDF.sort_values(by="baseMean", ascending=False)

Unnamed: 0,index,name,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj,source
28,0,PLPP1,3225.374676,2.917388,0.073816,39.522346,0.0,0.0,PRAD
11,11,PLPP1,3225.374676,-2.044375,0.12262,-16.672393,2.080946e-62,4.669548e-61,ESCA
29,0,LINC01002,2323.659751,2.518855,0.215464,11.690392,1.427263e-31,1.668377e-30,TGCT
12,12,LINC01002,2323.659751,2.246173,0.18301,12.273488,1.2571779999999999e-34,1.242074e-33,ESCA
24,0,RPL22P1,1196.350755,-2.699888,0.200524,-13.464141,2.542582e-41,5.525048e-40,GBM
7,7,RPL22P1,1196.350755,-2.432305,0.182654,-13.316471,1.856801e-40,2.232225e-39,ESCA
23,0,MAGEH1,897.382865,-2.008921,0.084681,-23.723275,2.074354e-124,2.21645e-122,COAD
6,6,MAGEH1,897.382865,-2.123894,0.101233,-20.980312,9.924011e-98,4.704941e-96,ESCA
8,8,SAMD9,877.161884,2.455999,0.154959,15.849343,1.420577e-56,2.7246910000000003e-55,ESCA
25,0,SAMD9,877.161884,-2.017421,0.125557,-16.067711,4.296612e-58,8.737936999999999e-57,KIRP
