# Hyperparameter Tunning Result 1
Andrew E. Davidson  
aedavids@ucsc.edu  
12/15/23

**<span style="color:red">see extraCellularRNA/deconvolutionAnalysis/doc/debugPipeLineStages.md</span>**  
It explains why performs dropped on 1/1/23.

In general I think there is a better "best" algorithm

**goal**  
- We want to find a set of features (genes) we can use to create plasma classifiers with good sensitivity and specificity
- we only have a small number of samples. We need to use as few features as possible
- use our 1 vs. all and deconvolution to find the best set of genes

**1 vs. all, deconvolution parameter dimensions, axis, ...**  
construct gene signatue matrix as follows
- number of best/top genes
  * --padjThreshold 0.001 --lfcThreshold 2.0 sorted by base mean
- remove genes in intersections with degree > 10
  * these are probably biologically interesting however not good discriminators
- enrich degee1 genes
  * ensure all types/categories/class have at least n 3
- create signature matrix from degree 1 genes

**pipeline flow**
best N -> remove degree > 10 -> enrich min 3 -> degree 1 genes


**data overview**
For each run of the pipeline we contruct a row in a dataframe. The row contains at test metric for each category, along with test metric mean, standard deviation, number of genes, number of categories with test metric > threshold, ...

Example:
```
bestDF.iloc[:, [0, 1, 2, 83 84, 85 ] ]

id	ACC	Adipose_Subcutaneous Adipose_Visceral_Omentum	mean_sensitivity std_sensitivity numGenes
best20GTEx_TCGA	0.562	0.950	0.637	0.704458	0.215157	257
best25GTEx_TCGA	0.562	0.942	0.671	0.709084	0.215128	321
best30GTEx_TCGA	0.542	0.952	0.698	0.711289	0.214588	381
best50GTEx_TCGA	0.562	0.960	0.760	0.722566	0.207179	596
best100GTEx_TCGA 0.562	0.972	0.803	0.734277	0.205296	1148
```

In [1]:
import ast
import ipynbname

# use display() to print an html version of a data frame
# useful if dataFrame output is not generated by last like of cell
from IPython.display import display

import numpy as np
import pandas as pd
# display all columns
pd.set_option('display.max_columns', None)

import pathlib as pl
import os
import sys

In [2]:
# setting the python path allows us to run python scripts from using
# the CLI. 
ORIG_PYTHONPATH = os.environ['PYTHONPATH']

pp = ipynbname.path()
deconvolutionModules = pp.parent.joinpath("../../python")
print("deconvolutionModules: {}\n".format(deconvolutionModules))

PYTHONPATH = ORIG_PYTHONPATH + f':{deconvolutionModules}'
print("PYTHONPATH: {}\n".format(PYTHONPATH))

os.environ["PYTHONPATH"] = PYTHONPATH
PYTHONPATH = os.environ["PYTHONPATH"]
print("PYTHONPATH: {}\n".format(PYTHONPATH))

deconvolutionModules: /private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../python

PYTHONPATH: :/private/home/aedavids/extraCellularRNA/src:/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../python

PYTHONPATH: :/private/home/aedavids/extraCellularRNA/src:/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../python



In [3]:
# to be able to import our local python files we need to set the sys.path
# https://stackoverflow.com/a/50155834
sys.path.append( str(deconvolutionModules) )
print("\nsys.path:\n{}\n".format(sys.path))


sys.path:
['/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning', '/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning', '/private/home/aedavids/extraCellularRNA/src', '/private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/python311.zip', '/private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/python3.11', '/private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/python3.11/lib-dynload', '', '/private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/python3.11/site-packages', '/private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/python3.11/site-packages/setuptools-57.4.0-py3.9.egg', '/private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/python3.11/site-packages/pip-21.1.3-py3.9.egg', '/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../python']



In [4]:
from analysis.hyperParameterTunningMetrics import metricsRunner, elifeCols, lungCols
from analysis.hyperParameterTunningMetrics import findSummaryMetricsCols
from analysis.utilities import findIntersectionsWithDegree
from analysis.utilities import loadDictionary

In [5]:
root = "/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category"
notebookName = ipynbname.name()
outDir = f'{root}/hyperParameter/{notebookName}.out'
print( f'output dir: \n{outDir}' )
os.makedirs(outDir, exist_ok=True)

output dir: 
/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out


### Best 1vsAll axis
selected top N genes 

In [6]:
bestResultsDirs = [ 
              "best20GTEx_TCGA",
              "best25GTEx_TCGA", 
              "best30GTEx_TCGA", 
              "best50GTEx_TCGA",
              "best100GTEx_TCGA",     
              "best200GTEx_TCGA",     
              # "best500GTEx_TCGA",     
]


outFilePrefix = "best"
bestDF, BestBellowThresholdDF = metricsRunner(root, outDir, outFilePrefix, bestResultsDirs, 
                       metric='sensitivity', threshold=0.7, verbose=False)

bestDF.loc[:, findSummaryMetricsCols('sensitivity') + elifeCols  ]


saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/best.sensitivity.0.7.csv

saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/best.sensitivity.bellow.0.7.csv


id,mean_sensitivity,std_sensitivity,numGenes,numTypes,numDegree1,numAboveThreshold,LUAD,LUSC,COAD,READ,ESCA,LIHC,STAD,Whole_Blood
best20GTEx_TCGA,0.704458,0.216465,258,83,38,45,0.447,0.495,0.437,0.821,0.189,0.713,0.276,0.982
best25GTEx_TCGA,0.709084,0.216436,322,83,44,43,0.44,0.508,0.456,0.804,0.18,0.717,0.271,0.985
best30GTEx_TCGA,0.711289,0.215893,382,83,45,47,0.469,0.505,0.475,0.786,0.135,0.78,0.271,0.987
best50GTEx_TCGA,0.722566,0.208438,597,83,49,49,0.469,0.571,0.532,0.786,0.171,0.789,0.244,0.987
best100GTEx_TCGA,0.734277,0.206544,1149,83,62,50,0.528,0.654,0.57,0.732,0.117,0.789,0.204,0.987
best200GTEx_TCGA,0.75112,0.197365,2217,83,73,53,0.605,0.708,0.563,0.714,0.135,0.812,0.244,0.982


## Explore best200 and best500
Does this provide support evidence algo designed on 1/2/24 is worth pursuing?

In [7]:
def exploreDegree1Interesections(intesectionPath : str):
    intersectionDict = loadDictionary(intesectionPath)
    degree1Dict = findIntersectionsWithDegree(intersectionDict, degree=1)
    totalUniqueGenes = 0
    for category,genes in degree1Dict.items():
        nUnique = len(genes)
        totalUniqueGenes += nUnique
        print( f'category: {category} : num genes : {nUnique}' )
    
    print( f'\n totalUniqueGenes : {totalUniqueGenes}' )

    return (intersectionDict, degree1Dict, totalUniqueGenes)
 

In [8]:
# use best20 as a baseline
def exploreBest20():
    best20DictPathDictPath = "/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/best20GTEx_TCGA/training/best20GTEx_TCGA.sh.out/upsetPlot.out/best20.intersection.dict"
    intersectionDict, degree1Dict, totalUniqueGenes = exploreDegree1Interesections(best20DictPathDictPath)

exploreBest20()

category: ('Adipose_Visceral_Omentum',) : num genes : 1
category: ('Artery_Coronary',) : num genes : 2
category: ('BLCA',) : num genes : 4
category: ('BRCA',) : num genes : 3
category: ('Bladder',) : num genes : 6
category: ('Breast_Mammary_Tissue',) : num genes : 1
category: ('CHOL',) : num genes : 2
category: ('Cervix_Endocervix',) : num genes : 6
category: ('DLBC',) : num genes : 3
category: ('ESCA',) : num genes : 1
category: ('Esophagus_Gastroesophageal_Junction',) : num genes : 1
category: ('GBM',) : num genes : 3
category: ('Heart_Atrial_Appendage',) : num genes : 3
category: ('Kidney_Cortex',) : num genes : 1
category: ('LIHC',) : num genes : 1
category: ('LUAD',) : num genes : 2
category: ('LUSC',) : num genes : 1
category: ('Lung',) : num genes : 1
category: ('MESO',) : num genes : 2
category: ('Minor_Salivary_Gland',) : num genes : 1
category: ('Muscle_Skeletal',) : num genes : 3
category: ('OV',) : num genes : 1
category: ('PAAD',) : num genes : 7
category: ('PCPG',) : num 

In [9]:
def exploreBest200():
    best200DictPathDictPath = "/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/best200GTEx_TCGA/training/best200GTEx_TCGA.sh.out/upsetPlot.out/best200.intersection.dict"
    intersectionDict, degree1Dict, totalUniqueGenes = exploreDegree1Interesections(best200DictPathDictPath)
    return (intersectionDict, degree1Dict, totalUniqueGenes)
    
bet200IntersectionDict, best200Degree1Dict, best200TotalUniqueGenes = exploreBest200()

category: ('ACC',) : num genes : 7
category: ('Adipose_Subcutaneous',) : num genes : 1
category: ('Adipose_Visceral_Omentum',) : num genes : 2
category: ('Adrenal_Gland',) : num genes : 4
category: ('Artery_Aorta',) : num genes : 1
category: ('Artery_Coronary',) : num genes : 4
category: ('Artery_Tibial',) : num genes : 3
category: ('BLCA',) : num genes : 13
category: ('BRCA',) : num genes : 15
category: ('Bladder',) : num genes : 32
category: ('Brain_Cerebellar_Hemisphere',) : num genes : 2
category: ('Brain_Cerebellum',) : num genes : 1
category: ('Brain_Cortex',) : num genes : 1
category: ('Brain_Frontal_Cortex_BA9',) : num genes : 2
category: ('Brain_Spinal_cord_cervical_c-1',) : num genes : 1
category: ('CESC',) : num genes : 10
category: ('CHOL',) : num genes : 20
category: ('COAD',) : num genes : 4
category: ('Cells_Cultured_fibroblasts',) : num genes : 5
category: ('Cells_EBV-transformed_lymphocytes',) : num genes : 1
category: ('Cervix_Endocervix',) : num genes : 96
category: 

### We found 22 unique LUAD and 19 unique LUSC
Best 200 returns at most 200 deseq results rows. These rows meet our biologic signigance criteria. Over all sensitivity is not great. Keep in mind best200 has 2059 many of these are poor discriminators. ie they are intersections with high a high degree

In [10]:
best2001vsAllResultsDir = "/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/best200GTEx_TCGA/training/best200GTEx_TCGA.sh.out/GTEx_TCGA-design-tilda_gender_category-padj-0001-lfc-20-n-200"
LUADPath = f'{best2001vsAllResultsDir}/LUAD_vs_all.results'
LUADResultsDF = pd.read_csv(LUADPath)

print(f'LUADResultsDF.shape : {LUADResultsDF.shape}')
# LUADResultsDF.head()
best200Genes = best200Degree1Dict[('LUAD',)]
selectRows = LUADResultsDF.loc[:, "name"].isin(best200Genes)
LUADResultsDF.loc[selectRows, :]

LUADResultsDF.shape : (200, 7)


Unnamed: 0,name,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
96,CPM,2031.120136,2.243723,0.091136,24.619449,7.820893999999999e-134,1.617437e-131
103,NAPSA,1863.020676,6.808249,0.160603,42.391826,0.0,0.0
113,SEMA6A,1659.715353,-2.506262,0.085304,-29.380225,9.826587e-190,6.130019e-187
129,HMGB3,1429.246162,2.026619,0.073351,27.629207,4.961875e-168,2.195514e-165
139,CRLF1,1287.980457,3.044996,0.116143,26.217546,1.67669e-151,5.023864e-149
149,SCN7A,1214.557916,2.007483,0.139291,14.412177,4.338181e-47,9.710635e-46
156,ADCY2,1143.054913,-3.372049,0.141744,-23.789673,4.2719690000000003e-125,7.322577e-123
160,DRAM1,1101.891449,2.000329,0.069934,28.603167,6.1361550000000005e-180,3.28872e-177
161,KRT23,1100.808127,-2.130086,0.190668,-11.171684,5.610699e-29,5.753272e-28
162,AC006115.2,1088.317661,-3.549901,0.154433,-22.986742,6.326273e-117,9.084288999999999e-115


In [11]:
LUSCPath = f'{best2001vsAllResultsDir}/LUSC_vs_all.results'
LUSCResultsDF = pd.read_csv(LUSCPath)

print(f'LUSCResultsDF.shape : {LUSCResultsDF.shape}')
best200Genes = best200Degree1Dict[('LUSC',)]
selectRows = LUSCResultsDF.loc[:, "name"].isin(best200Genes)
LUSCResultsDF.loc[selectRows, :]

LUSCResultsDF.shape : (200, 7)


Unnamed: 0,name,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
89,SRXN1,2153.961341,2.21605,0.065354,33.908658,4.965815e-252,5.671102e-249
100,ACKR3,1956.341452,2.100492,0.087331,24.052191,7.918539000000001e-128,1.030983e-125
101,GCLC,1919.390917,2.091534,0.056606,36.948753,7.627089000000001e-299,2.54052e-295
131,RNF157,1485.068661,-2.17476,0.117775,-18.465401,3.921031e-76,1.47162e-74
147,RBPMS2,1295.946716,-2.145042,0.118793,-18.057046,6.944102e-73,2.3866100000000002e-71
154,QPRT,1255.637483,-2.049667,0.106324,-19.277617,8.280377e-83,3.677499e-81
158,REEP6,1224.74737,-2.717801,0.12355,-21.997622,3.034756e-107,2.511434e-105
163,NEGR1,1178.589947,-2.041367,0.102591,-19.89813,4.223964e-88,2.175723e-86
164,RORC,1176.09748,-2.607522,0.143041,-18.229147,3.030089e-74,1.078501e-72
173,ME1,1133.241584,2.036512,0.077647,26.227782,1.281489e-151,2.799039e-149


## best Removed
remove genes that are shared between many class/types/categories from the best results. They may be biologically interseting
how ever are probably not good discriminators

In [12]:
bestRemoved_10_ResultsDirs = [ 
    "best20RemovedGTEx_TCGA", 
    "best25RemovedGTEx_TCGA", 
    "best30RemovedGTEx_TCGA", 
    "best50RemovedGTEx_TCGA",
    "best100RemovedGTEx_TCGA",     
#     "best200RemovedGTEx_TCGA",
#     "best500RemovedGTEx_TCGA",         
]

outFilePrefix = "bestRemoved_10"
bestRemoved_10_DF, bestRemoved_10BellowThresholdDF= metricsRunner(root, outDir, outFilePrefix, bestRemoved_10_ResultsDirs, 
                       metric='sensitivity', threshold=0.7)

bestRemoved_10_DF.loc[:, findSummaryMetricsCols('sensitivity') + elifeCols ]


saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/bestRemoved_10.sensitivity.0.7.csv

saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/bestRemoved_10.sensitivity.bellow.0.7.csv


id,mean_sensitivity,std_sensitivity,numGenes,numTypes,numDegree1,numAboveThreshold,LUAD,LUSC,COAD,READ,ESCA,LIHC,STAD,Whole_Blood
best20RemovedGTEx_TCGA,0.714072,0.216875,213,83,38,46,0.482,0.661,0.392,0.732,0.207,0.677,0.342,0.958
best25RemovedGTEx_TCGA,0.718542,0.21189,263,83,44,45,0.502,0.678,0.443,0.696,0.198,0.682,0.329,0.969
best30RemovedGTEx_TCGA,0.725518,0.210816,309,83,45,48,0.521,0.651,0.5,0.714,0.171,0.731,0.342,0.976
best50RemovedGTEx_TCGA,0.729458,0.206537,476,83,49,47,0.55,0.674,0.513,0.732,0.144,0.722,0.333,0.976
best100RemovedGTEx_TCGA,0.733554,0.206084,906,83,62,48,0.57,0.728,0.589,0.696,0.135,0.744,0.262,0.982


In [13]:
# TODO aedwip check run scripts. are we using the correct intersection dict?
# bestRemoved_5_ResultsDirs = [ 
#     "best20Removed_5_GTEx_TCGA", 
#     "best25Removed_5_GTEx_TCGA", 
#     "best30Removed_5_GTEx_TCGA", 
#     "best50Removed_5_GTEx_TCGA",
#     "best100Removed_5_GTEx_TCGA",     
#     # "best200Removed_5_GTEx_TCGA",
#     # "best500Removed_5_GTEx_TCGA",         
# ]

# outFilePrefix = "bestRemoved_5"
# bestRemoved_5_DF= metricsRunner(root, outDir, outFilePrefix, bestRemoved_5_ResultsDirs, 
#                        metric='sensitivity', threshold=0.7)

# bestRemoved_5_DF.loc[:, findSummaryMetricsCols('sensitivity') + elifeCols ]

## best enriched
Starting with the bestRemovedResultsDirs make sure all categories/types/class have at least 3 unique genes

In [14]:
bestEnriched_3_ResultsDirs = [ 
    "best20EnrichedGTEx_TCGA",
    "best25EnrichedGTEx_TCGA",
    "best30EnrichedGTEx_TCGA",
    "best50EnrichedGTEx_TCGA",
    "best100EnrichedGTEx_TCGA",    
]

outFilePrefix = "bestEnriched_3"
bestEnriched_3_DF, bestEnriched_3BellowThresholdDF = metricsRunner(root, outDir, outFilePrefix, bestEnriched_3_ResultsDirs, 
                       metric='sensitivity', threshold=0.7, verbose=True)

bestEnriched_3_DF.loc[:, findSummaryMetricsCols('sensitivity') + elifeCols ]

path : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/best20EnrichedGTEx_TCGA

load best20EnrichedGTEx_TCGA :
/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/best20EnrichedGTEx_TCGA/training/best20EnrichedGTEx_TCGA.sh.out/metrics/metricsRounded.csv

load
/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/best20EnrichedGTEx_TCGA/training/best20EnrichedGTEx_TCGA.sh.out/upsetPlot.out/best20_degreeThreshold_10_enrich_3.intersection.dict

best20EnrichedGTEx_TCGA types without degree 1 intersections: 
 {'Brain_Amygdala', 'Adrenal_Gland', 'Spleen', 'ACC', 'Heart_Left_Ventricle'}
path : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/best25EnrichedGTEx_TCGA

load best25EnrichedGTEx_TCGA :
/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/best25EnrichedGTEx_TCGA/training/best25EnrichedGTEx_TCGA.sh.out/metrics/metricsRounded.csv

load
/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gen

id,mean_sensitivity,std_sensitivity,numGenes,numTypes,numDegree1,numAboveThreshold,LUAD,LUSC,COAD,READ,ESCA,LIHC,STAD,Whole_Blood
best20EnrichedGTEx_TCGA,0.712554,0.213826,403,83,78,47,0.479,0.528,0.481,0.732,0.171,0.767,0.267,0.987
best25EnrichedGTEx_TCGA,0.710386,0.213231,435,83,77,46,0.472,0.532,0.481,0.732,0.189,0.749,0.258,0.987
best30EnrichedGTEx_TCGA,0.710735,0.215248,481,83,76,48,0.466,0.532,0.487,0.732,0.162,0.776,0.213,0.985
best50EnrichedGTEx_TCGA,0.71953,0.208035,659,83,70,48,0.479,0.565,0.506,0.786,0.18,0.794,0.244,0.987
best100EnrichedGTEx_TCGA,0.733096,0.207526,1165,83,69,50,0.531,0.651,0.563,0.75,0.117,0.789,0.204,0.987


In [15]:
aedwip debug how come some classes do not have degree1 sets?

Object `sets` not found.


## best enriched
Starting with the bestRemovedResultsDirs make sure all categories/types/class have at least 6 unique genes

In [16]:
bestEnriched_6_ResultsDirs = [ 
    "best20Enriched_6_GTEx_TCGA",
    "best25Enriched_6_GTEx_TCGA",
    "best30Enriched_6_GTEx_TCGA",
    "best50Enriched_6_GTEx_TCGA",
    "best100Enriched_6_GTEx_TCGA",    
]

outFilePrefix = "bestEnriched_6"
bestEnriched_6DF, bestEnriched_6BellowThresholdDF = metricsRunner(root, outDir, outFilePrefix, bestEnriched_6_ResultsDirs, 
                       metric='sensitivity', threshold=0.7)

bestEnriched_6DF.loc[:, findSummaryMetricsCols('sensitivity') + elifeCols ]


saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/bestEnriched_6.sensitivity.0.7.csv

saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/bestEnriched_6.sensitivity.bellow.0.7.csv


id,mean_sensitivity,std_sensitivity,numGenes,numTypes,numDegree1,numAboveThreshold,LUAD,LUSC,COAD,READ,ESCA,LIHC,STAD,Whole_Blood
best20Enriched_6_GTEx_TCGA,0.715229,0.21114,636,83,83,47,0.485,0.575,0.494,0.732,0.162,0.758,0.244,0.982
best25Enriched_6_GTEx_TCGA,0.713807,0.211382,664,83,82,46,0.482,0.591,0.532,0.732,0.153,0.767,0.231,0.982
best30Enriched_6_GTEx_TCGA,0.714795,0.211291,696,83,81,47,0.485,0.585,0.532,0.732,0.153,0.762,0.236,0.982
best50Enriched_6_GTEx_TCGA,0.723855,0.207689,854,83,78,50,0.511,0.625,0.551,0.714,0.144,0.758,0.213,0.985
best100Enriched_6_GTEx_TCGA,0.736096,0.206509,1257,83,76,50,0.54,0.664,0.576,0.768,0.117,0.794,0.204,0.987


## Best Unique Genes
Only use genes from interesection of degree 1. enriched with minium of 3 genes

In [17]:
bestUniqueResults = [
       "best20EnrichedDegree1GTEx_TCGA",
        "best25EnrichedDegree1GTEx_TCGA",
        "best30EnrichedDegree1GTEx_TCGA",
        "best50EnrichedDegree1GTEx_TCGA",
        "best100EnrichedDegree1GTEx_TCGA",
    ]

outFilePrefix = "bestUnique"
bestUniqueDF, bestUniqueBellowThresholdDF = metricsRunner(root, outDir, outFilePrefix, bestUniqueResults, 
                       metric='sensitivity', threshold=0.7)

bestUniqueDF.loc[:, findSummaryMetricsCols('sensitivity') + elifeCols  ]


saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/bestUnique.sensitivity.0.7.csv

saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/bestUnique.sensitivity.bellow.0.7.csv


id,mean_sensitivity,std_sensitivity,numGenes,numTypes,numDegree1,numAboveThreshold,LUAD,LUSC,COAD,READ,ESCA,LIHC,STAD,Whole_Blood
best20EnrichedDegree1GTEx_TCGA,0.762687,0.219785,234,83,78,54,0.379,0.621,0.424,0.75,0.279,0.601,0.116,0.974
best25EnrichedDegree1GTEx_TCGA,0.760289,0.214121,232,83,77,55,0.46,0.595,0.399,0.714,0.261,0.789,0.151,0.976
best30EnrichedDegree1GTEx_TCGA,0.751542,0.21242,243,83,76,53,0.372,0.561,0.367,0.768,0.279,0.74,0.142,0.96
best50EnrichedDegree1GTEx_TCGA,0.741205,0.217547,260,83,70,50,0.424,0.502,0.557,0.714,0.369,0.744,0.298,0.987
best100EnrichedDegree1GTEx_TCGA,0.782542,0.205068,411,83,69,60,0.602,0.661,0.5,0.696,0.288,0.771,0.276,0.985


## Best Unique Genes
Only use genes from interesection of degree 1. enriched with minium of 6 genes

In [18]:
bestUnique_6_Results = [
       "best20Enriched_6_Degree1GTEx_TCGA",
        "best25Enriched_6_Degree1GTEx_TCGA",
        "best30Enriched_6_Degree1GTEx_TCGA",
        "best50Enriched_6_Degree1GTEx_TCGA",
        "best100Enriched_6_Degree1GTEx_TCGA",
    ]

outFilePrefix = "bestUnique_6"
bestUnique_6_DF, bestUnique_6BellowThresholdDF = metricsRunner(root, outDir, outFilePrefix, bestUnique_6_Results, 
                       metric='sensitivity', threshold=0.7)

bestUnique_6_DF.loc[:, findSummaryMetricsCols('sensitivity') + elifeCols  ]


saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/bestUnique_6.sensitivity.0.7.csv

saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/bestUnique_6.sensitivity.bellow.0.7.csv


id,mean_sensitivity,std_sensitivity,numGenes,numTypes,numDegree1,numAboveThreshold,LUAD,LUSC,COAD,READ,ESCA,LIHC,STAD,Whole_Blood
best20Enriched_6_Degree1GTEx_TCGA,0.790976,0.194126,467,83,83,60,0.54,0.645,0.544,0.732,0.252,0.677,0.182,0.96
best25Enriched_6_Degree1GTEx_TCGA,0.790614,0.192834,461,83,82,59,0.599,0.668,0.563,0.679,0.252,0.731,0.231,0.96
best30Enriched_6_Degree1GTEx_TCGA,0.788277,0.191369,458,83,81,60,0.621,0.664,0.519,0.768,0.252,0.735,0.218,0.951
best50Enriched_6_Degree1GTEx_TCGA,0.783024,0.188576,455,83,78,58,0.583,0.645,0.614,0.696,0.279,0.686,0.262,0.978
best100Enriched_6_Degree1GTEx_TCGA,0.798024,0.187475,503,83,76,62,0.673,0.704,0.551,0.786,0.306,0.807,0.316,0.978


In [19]:
def poorSensitivityAEDWIP():
    '''
    TODO parameterize
    '''
    metricCols = ['mean_sensitivity', 'std_sensitivity', 'numGenes', 'numTypes', 
                  'numDegree1', 'numAboveThreshold']
    
    selectSensitivityCols = ~bestUnique_6_DF.columns.isin( metricCols )
    sensitivityColsIndex = bestUnique_6_DF.columns[selectSensitivityCols]
    assert len(sensitivityColsIndex) == 83
    
    sensitivitySeries = bestUnique_6_DF.loc['best100Enriched_6_Degree1GTEx_TCGA', sensitivityColsIndex] 
    
    defaultThreshold = 0.7
    selectRows =  sensitivitySeries < defaultThreshold

    poorSensitivitySeries = sensitivitySeries.loc[selectRows] 
    
    return poorSensitivitySeries

bestUnique_6_PoorSensitivitySeries = poorSensitivityAEDWIP()
bestUnique_6_PoorSensitivitySeries

id
BLCA                                    0.689
BRCA                                    0.679
Brain_Amygdala                          0.692
Brain_Anterior_cingulate_cortex_BA24    0.453
Brain_Caudate_basal_ganglia             0.635
Brain_Frontal_Cortex_BA9                0.690
Brain_Hippocampus                       0.517
Brain_Putamen_basal_ganglia             0.553
Brain_Substantia_nigra                  0.548
Breast_Mammary_Tissue                   0.609
COAD                                    0.551
Colon_Transverse                        0.572
ESCA                                    0.306
Esophagus_Gastroesophageal_Junction     0.600
Esophagus_Muscularis                    0.693
LUAD                                    0.673
PAAD                                    0.439
SARC                                    0.077
STAD                                    0.316
Small_Intestine_Terminal_Ileum          0.598
Vagina                                  0.553
Name: best100Enriched_6_Degree1

## check specificity

In [20]:
outFilePrefix = "bestUnique_6_Specificity"
bestUnique_6_SpecificityDF, bestUnique_6_SpecificityBellowThresholdDF = metricsRunner(root, outDir, outFilePrefix, bestUnique_6_Results, 
                       metric='specificity', threshold=0.7)

bestUnique_6_SpecificityDF.loc[:, findSummaryMetricsCols('specificity') + elifeCols  ]


saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/bestUnique_6_Specificity.specificity.0.7.csv

saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/bestUnique_6_Specificity.specificity.bellow.0.7.csv


id,mean_specificity,std_specificity,numGenes,numTypes,numDegree1,numAboveThreshold,LUAD,LUSC,COAD,READ,ESCA,LIHC,STAD,Whole_Blood
best20Enriched_6_Degree1GTEx_TCGA,0.997506,0.002773,467,83,83,83,1.0,0.996,0.993,0.992,1.0,1.0,1.0,1.0
best25Enriched_6_Degree1GTEx_TCGA,0.997518,0.002843,461,83,82,83,1.0,0.996,0.993,0.992,1.0,0.999,1.0,1.0
best30Enriched_6_Degree1GTEx_TCGA,0.997554,0.002777,458,83,81,83,1.0,0.996,0.995,0.992,1.0,0.999,0.999,1.0
best50Enriched_6_Degree1GTEx_TCGA,0.99741,0.003037,455,83,78,83,1.0,0.996,0.991,0.993,1.0,1.0,0.999,1.0
best100Enriched_6_Degree1GTEx_TCGA,0.997711,0.002667,503,83,76,83,1.0,0.993,0.995,0.991,0.999,1.0,0.999,1.0


In [21]:
def poorSpecificityAEDWIP(threshold):
    '''
    TODO parameterize
    '''
    metricCols = ['mean_specificity', 'std_specificity', 'numGenes', 'numTypes', 
                  'numDegree1', 'numAboveThreshold']
    
    specificityCols = ~bestUnique_6_SpecificityDF.columns.isin( metricCols )
    specificitySeries = bestUnique_6_SpecificityDF.loc['best100Enriched_6_Degree1GTEx_TCGA', specificityCols] 
 
    selectRows =  specificitySeries < threshold

    poorSpecificitySeries = specificitySeries.loc[selectRows] 
    
    return poorSpecificitySeries

threshold = 0.99
bestUnique_6_PoorSpecificitySeries = poorSpecificityAEDWIP(threshold=threshold)
print(f' categories with specificity < {threshold}' )
bestUnique_6_PoorSpecificitySeries

 categories with specificity < 0.99


id
Colon_Sigmoid    0.985
Name: best100Enriched_6_Degree1GTEx_TCGA, dtype: float64

## Evaluate Elife performance on training data

In [22]:
# sensitivity
display(bestUnique_6_DF.loc['best100Enriched_6_Degree1GTEx_TCGA', elifeCols])


bestUnique_6_SpecificityDF.loc['best100Enriched_6_Degree1GTEx_TCGA', elifeCols]

id
LUAD           0.673
LUSC           0.704
COAD           0.551
READ           0.786
ESCA           0.306
LIHC           0.807
STAD           0.316
Whole_Blood    0.978
Name: best100Enriched_6_Degree1GTEx_TCGA, dtype: float64

id
LUAD           1.000
LUSC           0.993
COAD           0.995
READ           0.991
ESCA           0.999
LIHC           1.000
STAD           0.999
Whole_Blood    1.000
Name: best100Enriched_6_Degree1GTEx_TCGA, dtype: float64

In [23]:
def evalMetrics(cols : list[str],
                sensitivityDF : pd.DataFrame,
                specificityDF : pd.DataFrame,
                runName : str,
               ) -> pd.DataFrame :
    '''
    TODO
    arguments:
        cols:
            example ["COAD", "READ", "Colon_Sigmoid", ]
    '''

    # sensitivitySeries = bestUnique_6_DF.loc['best100Enriched_6_Degree1GTEx_TCGA', cols]
    sensitivitySeries = sensitivityDF.loc[runName, cols]
    sensitivitySeries.name = "sensitivity"
    
    # specificitySeries = bestUnique_6_SpecificityDF.loc['best100Enriched_6_Degree1GTEx_TCGA', cols]
    specificitySeries = specificityDF.loc[runName, cols]
    specificitySeries.name = "specificity"
    
    byCols=1
    retDF = pd.concat( [sensitivitySeries, specificitySeries], axis=byCols).sort_values(by="id")

    return retDF

In [24]:
bestUnique_6_elifeMetricsDF = evalMetrics( elifeCols, 
                            sensitivityDF=bestUnique_6_DF,  
                            specificityDF=bestUnique_6_SpecificityDF,
                            runName='best100Enriched_6_Degree1GTEx_TCGA',
                                         )
print(f'elife types best 100 enriched by six degree  1 (i.e. at least 6 unique genes for each type')
bestUnique_6_elifeMetricsDF

elife types best 100 enriched by six degree  1 (i.e. at least 6 unique genes for each type


Unnamed: 0_level_0,sensitivity,specificity
id,Unnamed: 1_level_1,Unnamed: 2_level_1
COAD,0.551,0.995
ESCA,0.306,0.999
LIHC,0.807,1.0
LUAD,0.673,1.0
LUSC,0.704,0.993
READ,0.786,0.991
STAD,0.316,0.999
Whole_Blood,0.978,1.0


## Enrich LUAD and LUSC
<span style="color:red">we added 3,6.,9, or 12 genes to each. Does not improve sensitivity or specificity of LUAD and LUSC.</span>

hypothesis. 
- We did not assign "best" genes in an optimized way.
- LUAD and LUSC are very close. 1vsAll does not find fine grained difference. The 1vsAll are both differentially expressed in similar ways with respect to all the remaining classes. 

In [26]:
aedwip check and re run
enrichLUAD_LUSC_results = [
    "best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_3",
    "best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_6",
    "best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_9",
    "best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_12",
]

outFilePrefix = "enrichLUAD_LUSC_results"
LUAD_LUSC_3_sensitivityDF, LUAD_LUSC_3_sensitivityBellowThresholdDF = metricsRunner(root, outDir, outFilePrefix, enrichLUAD_LUSC_results, 
                       metric='sensitivity', threshold=0.7)

LUAD_LUSC_3_sensitivityDF.loc[:, findSummaryMetricsCols('sensitivity') + elifeCols ]


saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/enrichLUAD_LUSC_results.sensitivity.0.7.csv

saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/enrichLUAD_LUSC_results.sensitivity.bellow.0.7.csv


id,mean_sensitivity,std_sensitivity,numGenes,numTypes,numDegree1,numAboveThreshold,LUAD,LUSC,COAD,READ,ESCA,LIHC,STAD,Whole_Blood
best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_3,0.812614,0.178801,643,83,83,68,0.706,0.744,0.563,0.732,0.288,0.807,0.342,0.971
best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_6,0.814518,0.177712,649,83,83,68,0.712,0.738,0.576,0.75,0.288,0.825,0.342,0.971
best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_9,0.814422,0.177152,655,83,83,68,0.718,0.741,0.576,0.714,0.288,0.825,0.347,0.971
best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_12,0.813771,0.176957,661,83,83,67,0.689,0.744,0.576,0.714,0.279,0.825,0.356,0.976


In [28]:
outFilePrefix = "enrichLUAD_LUSC_results_Specificity"
LUAD_LUSC_3_specificityDF, LUAD_LUSC_3_specificityBellowThreshold = metricsRunner(root, outDir, outFilePrefix, enrichLUAD_LUSC_results, 
                       metric='specificity', threshold=0.7)
LUAD_LUSC_3_specificityDF.loc[:, elifeCols ]


saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/enrichLUAD_LUSC_results_Specificity.specificity.0.7.csv

saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/enrichLUAD_LUSC_results_Specificity.specificity.bellow.0.7.csv


id,LUAD,LUSC,COAD,READ,ESCA,LIHC,STAD,Whole_Blood
best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_3,1.0,0.994,0.994,0.99,1.0,1.0,0.999,1.0
best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_6,1.0,0.995,0.994,0.989,1.0,1.0,0.999,1.0
best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_9,1.0,0.995,0.994,0.989,1.0,1.0,0.999,1.0
best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_12,1.0,0.994,0.993,0.989,1.0,1.0,0.999,1.0


In [29]:
# sensitivity

runName='best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_3'
LUAD_LUSC_3_elifeMetricsDF = evalMetrics( elifeCols, 
                            sensitivityDF=LUAD_LUSC_3_sensitivityDF,  
                            specificityDF=LUAD_LUSC_3_specificityDF,
                            runName=runName,
)
print(f'runName : {runName}')
print(f'elife types best 100 enriched by six degree  1 (i.e. at least 6 unique genes for each type')
print(f'added 3 genes to LUAD and LUSC')
LUAD_LUSC_3_elifeMetricsDF

runName : best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_3
elife types best 100 enriched by six degree  1 (i.e. at least 6 unique genes for each type
added 3 genes to LUAD and LUSC


Unnamed: 0_level_0,sensitivity,specificity
id,Unnamed: 1_level_1,Unnamed: 2_level_1
COAD,0.563,0.994
ESCA,0.288,1.0
LIHC,0.807,1.0
LUAD,0.706,1.0
LUSC,0.744,0.994
READ,0.732,0.99
STAD,0.342,0.999
Whole_Blood,0.971,1.0


In [30]:
# sensitivity

# outFilePrefix = "enrichLUAD_LUSC_results_Specificity"
# LUAD_LUSC_3_specificityDF = metricsRunner(root, outDir, outFilePrefix, enrichLUAD_LUSC_results, 
#                        metric='specificity', threshold=0.7)
# LUAD_LUSC_3_specificityDF.iloc[:, colSampleIdx ]

runName='best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_12'
LUAD_LUSC_3_elifeMetricsDF_12 = evalMetrics( elifeCols, 
                            sensitivityDF=LUAD_LUSC_3_sensitivityDF,  
                            specificityDF=LUAD_LUSC_3_specificityDF,
                            runName=runName,
)

print(f'runName : {runName}')
print(f'elife types best 100 enriched by six degree  1 (i.e. at least 6 unique genes for each type')
print(f'added 12 genes to LUAD and LUSC')
LUAD_LUSC_3_elifeMetricsDF

runName : best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_12
elife types best 100 enriched by six degree  1 (i.e. at least 6 unique genes for each type
added 12 genes to LUAD and LUSC


Unnamed: 0_level_0,sensitivity,specificity
id,Unnamed: 1_level_1,Unnamed: 2_level_1
COAD,0.563,0.994
ESCA,0.288,1.0
LIHC,0.807,1.0
LUAD,0.706,1.0
LUSC,0.744,0.994
READ,0.732,0.99
STAD,0.342,0.999
Whole_Blood,0.971,1.0


# Summary

TODO check and re-run enrichLUAD_LUSC_results

In [31]:
summaryResultsDirs = [
        "best100GTEx_TCGA",
        "best100RemovedGTEx_TCGA",     
        "best100EnrichedGTEx_TCGA",    
        "best100Enriched_6_GTEx_TCGA",    
        "best100EnrichedDegree1GTEx_TCGA",
        "best100Enriched_6_Degree1GTEx_TCGA",
]

outFilePrefix = "summary"
summaryDF, summaryBellowThresholdDF = metricsRunner(root, outDir, outFilePrefix, summaryResultsDirs, 
                       metric='sensitivity', threshold=0.7)

summaryDF.loc[:, findSummaryMetricsCols('sensitivity') + elifeCols  ]


saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/summary.sensitivity.0.7.csv

saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults1.out/summary.sensitivity.bellow.0.7.csv


id,mean_sensitivity,std_sensitivity,numGenes,numTypes,numDegree1,numAboveThreshold,LUAD,LUSC,COAD,READ,ESCA,LIHC,STAD,Whole_Blood
best100GTEx_TCGA,0.734277,0.206544,1149,83,62,50,0.528,0.654,0.57,0.732,0.117,0.789,0.204,0.987
best100RemovedGTEx_TCGA,0.733554,0.206084,906,83,62,48,0.57,0.728,0.589,0.696,0.135,0.744,0.262,0.982
best100EnrichedGTEx_TCGA,0.733096,0.207526,1165,83,69,50,0.531,0.651,0.563,0.75,0.117,0.789,0.204,0.987
best100Enriched_6_GTEx_TCGA,0.736096,0.206509,1257,83,76,50,0.54,0.664,0.576,0.768,0.117,0.794,0.204,0.987
best100EnrichedDegree1GTEx_TCGA,0.782542,0.205068,411,83,69,60,0.602,0.661,0.5,0.696,0.288,0.771,0.276,0.985
best100Enriched_6_Degree1GTEx_TCGA,0.798024,0.187475,503,83,76,62,0.673,0.704,0.551,0.786,0.306,0.807,0.316,0.978
