# Hyperparameter Tunning Result 2
Andrew E. Davidson  
aedavids@ucsc.edu  
12/15/23

**goal**  
- We want to find a set of features (genes) we can use to create plasma classifiers with good sensitivity and specificity
- we only have a small number of samples. We need to use as few features as possible
- use our 1 vs. all and deconvolution to find the best set of genes

**1 vs. all, deconvolution parameter dimensions, axis, ...**  
construct gene signatue matrix as follows
- number of best/top genes
  * --padjThreshold 0.001 --lfcThreshold 2.0 sorted by base mean
- remove genes in intersections with degree > 10
  * these are probably biologically interesting however not good discriminators
- enrich degee1 genes
  * ensure all types/categories/class have at least n 3
- create signature matrix from degree 1 genes

**pipeline flow**
best N -> remove degree > 10 -> enrich min 3 -> degree 1 genes


**data overview**
For each run of the pipeline we contruct a row in a dataframe. The row contains at test metric for each category, along with test metric mean, standard deviation, number of genes, number of categories with test metric > threshold, ...

Example:
```
bestDF.iloc[:, [0, 1, 2, 83 84, 85 ] ]

id	ACC	Adipose_Subcutaneous Adipose_Visceral_Omentum	mean_sensitivity std_sensitivity numGenes
best20GTEx_TCGA	0.562	0.950	0.637	0.704458	0.215157	257
best25GTEx_TCGA	0.562	0.942	0.671	0.709084	0.215128	321
best30GTEx_TCGA	0.542	0.952	0.698	0.711289	0.214588	381
best50GTEx_TCGA	0.562	0.960	0.760	0.722566	0.207179	596
best100GTEx_TCGA 0.562	0.972	0.803	0.734277	0.205296	1148
```

In [1]:
#import ast
import ipynbname

# use display() to print an html version of a data frame
# useful if dataFrame output is not generated by last like of cell
from IPython.display import display

import numpy as np
import pandas as pd
# display all columns
pd.set_option('display.max_columns', None)

# import pathlib as pl
import os
import sys

In [2]:
# setting the python path allows us to run python scripts from using
# the CLI. 
ORIG_PYTHONPATH = os.environ['PYTHONPATH']

pp = ipynbname.path()
deconvolutionModules = pp.parent.joinpath("../../python")
print("deconvolutionModules: {}\n".format(deconvolutionModules))

PYTHONPATH = ORIG_PYTHONPATH + f':{deconvolutionModules}'
print("PYTHONPATH: {}\n".format(PYTHONPATH))

os.environ["PYTHONPATH"] = PYTHONPATH
PYTHONPATH = os.environ["PYTHONPATH"]
print("PYTHONPATH: {}\n".format(PYTHONPATH))

deconvolutionModules: /private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../python

PYTHONPATH: :/private/home/aedavids/extraCellularRNA/src:/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../python

PYTHONPATH: :/private/home/aedavids/extraCellularRNA/src:/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../python



In [3]:
# to be able to import our local python files we need to set the sys.path
# https://stackoverflow.com/a/50155834
sys.path.append( str(deconvolutionModules) )
print("\nsys.path:\n{}\n".format(sys.path))


sys.path:
['/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning', '/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning', '/private/home/aedavids/extraCellularRNA/src', '/private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/python311.zip', '/private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/python3.11', '/private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/python3.11/lib-dynload', '', '/private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/python3.11/site-packages', '/private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/python3.11/site-packages/setuptools-57.4.0-py3.9.egg', '/private/home/aedavids/miniconda3/envs/extraCellularRNA/lib/python3.11/site-packages/pip-21.1.3-py3.9.egg', '/private/home/aedavids/extraCellularRNA/deconvolutionAnalysis/jupyterNotebooks/hyperParameterTunning/../../python']



In [4]:
from analysis.hyperParameterTunningMetrics import elifeCols, lungCols, findSummaryMetricsCols
from analysis.hyperParameterTunningMetrics import metricsRunner

In [5]:
root = "/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category"
notebookName = ipynbname.name()
outDir = f'{root}/hyperParameter/{notebookName}.out'
print( f'output dir: \n{outDir}' )
os.makedirs(outDir, exist_ok=True)

output dir: 
/private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults2.out


## Ranked
1. concat all deseq results files
2. select rows that are biologically signigant
3. sort by base mean
4. rows with strongest biologically signigant signal will be on top
5. itterate over the sorted list
   + when we find a "new" gene assign it to category with ie highest base mean. i.e. signal strength

Assigning genes to category based on signal strength may still produce poor discriminators. It is possible genes discovered this way are members of intersections with high degree


In [6]:
rankResultsDirs = [ 
             "rank5GTEx_TCGA",    
             "rank10GTEx_TCGA",    
             "rank15GTEx_TCGA",    
             "rank20GTEx_TCGA",    
]


outFilePrefix = "rank"
rankSensitivityDF = metricsRunner(root, outDir, outFilePrefix, rankResultsDirs, 
                       metric='sensitivity', threshold=0.7)

# print( "sensitivity summary + lung sensitivity")
# display(rankSensitivityDF.loc[: , findSummaryMetricsCols('sensitivity') + lungCols ])

print( "\nsensitivity summary + elife sensitivity \n")
display(rankSensitivityDF.loc[: , findSummaryMetricsCols('sensitivity') + elifeCols ])

AEDWIP path : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/rank5GTEx_TCGA
AEDWIP path : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/rank10GTEx_TCGA
AEDWIP path : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/rank15GTEx_TCGA
AEDWIP path : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/rank20GTEx_TCGA

saving : /private/groups/kimlab/aedavids/deconvolution/1vsAll-~gender_category/hyperParameter/hyperparameterTunningResults2.out/rank.sensitivity.0.7.csv

sensitivity summary + elife sensitivity 



id,mean_sensitivity,std_sensitivity,numGenes,numTypes,numDegree1,numAboveThreshold,Lung,LUAD,LUSC,Colon_Sigmoid,Colon_Transverse,COAD,READ,Esophagus_Gastroesophageal_Junction,Esophagus_Mucosa,Esophagus_Muscularis,ESCA,Liver,LIHC,Stomach,STAD
rank5GTEx_TCGA,0.71259,0.214442,126,83,83,48,0.988,0.45,0.518,0.911,0.527,0.494,0.714,0.533,0.913,0.641,0.189,0.949,0.749,0.688,0.253
rank10GTEx_TCGA,0.723349,0.209278,829,83,83,49,0.988,0.505,0.611,0.92,0.539,0.557,0.714,0.582,0.928,0.689,0.144,0.765,0.753,0.693,0.187
rank15GTEx_TCGA,0.735458,0.207971,1244,83,83,50,0.988,0.553,0.664,0.911,0.543,0.563,0.714,0.604,0.937,0.725,0.117,0.816,0.798,0.698,0.196
rank20GTEx_TCGA,0.743735,0.202535,1659,83,83,50,0.986,0.592,0.688,0.911,0.547,0.563,0.696,0.613,0.94,0.728,0.126,0.89,0.807,0.712,0.209


## check specificity

In [7]:
outFilePrefix = "bestUnique_6_Specificity"
bestUnique_6_SpecificityDF = metricsRunner(root, outDir, outFilePrefix, bestUnique_6_Results, 
                       metric='specificity', threshold=0.7)

NameError: name 'bestUnique_6_Results' is not defined

In [None]:
bestUnique_6_SpecificityDF.iloc[:, colSampleIdx ]

In [None]:
def poorSpecificityAEDWIP(threshold):
    '''
    TODO parameterize
    '''
    metricCols = ['mean_specificity', 'std_specificity', 'numGenes', 'numTypes', 
                  'numDegree1', 'numAboveThreshold']
    
    specificityCols = ~bestUnique_6_SpecificityDF.columns.isin( metricCols )
    specificitySeries = bestUnique_6_SpecificityDF.loc['best100Enriched_6_Degree1GTEx_TCGA', specificityCols] 
 
    selectRows =  specificitySeries < threshold

    poorSpecificitySeries = specificitySeries.loc[selectRows] 
    
    return poorSpecificitySeries

threshold = 0.99
bestUnique_6_PoorSpecificitySeries = poorSpecificityAEDWIP(threshold=threshold)
print(f' categories with specificity < {threshold}' )
bestUnique_6_PoorSpecificitySeries

## Evaluate Elife performance on training data

In [None]:
elifeCols = ["COAD", "READ", "Colon_Sigmoid", "Colon_Transverse", "ESCA", 
             "Esophagus_Gastroesophageal_Junction", "Esophagus_Mucosa", 
             "Esophagus_Muscularis", "LIHC", "Liver", "LUSC", "LUAD", 
             "Lung", "STAD", "Stomach" ]

# sensitivity
bestUnique_6_DF.loc['best100Enriched_6_Degree1GTEx_TCGA', elifeCols]


In [None]:
bestUnique_6_SpecificityDF.loc['best100Enriched_6_Degree1GTEx_TCGA', elifeCols]

In [None]:
def evalMetrics(cols : list[str],
                sensitivityDF : pd.DataFrame,
                specificityDF : pd.DataFrame,
                runName : str,
               ) -> pd.DataFrame :
    '''
    TODO
    arguments:
        cols:
            example ["COAD", "READ", "Colon_Sigmoid", ]
    '''

    # sensitivitySeries = bestUnique_6_DF.loc['best100Enriched_6_Degree1GTEx_TCGA', cols]
    sensitivitySeries = sensitivityDF.loc[runName, cols]
    sensitivitySeries.name = "sensitivity"
    
    # specificitySeries = bestUnique_6_SpecificityDF.loc['best100Enriched_6_Degree1GTEx_TCGA', cols]
    specificitySeries = specificityDF.loc[runName, cols]
    specificitySeries.name = "specificity"
    
    byCols=1
    retDF = pd.concat( [sensitivitySeries, specificitySeries], axis=byCols).sort_values(by="id")

    return retDF

In [None]:
bestUnique_6_elifeMetricsDF = evalMetrics( elifeCols, 
                            sensitivityDF=bestUnique_6_DF,  
                            specificityDF=bestUnique_6_SpecificityDF,
                            runName='best100Enriched_6_Degree1GTEx_TCGA',
                                         )
print(f'elife types best 100 enriched by six degree  1 (i.e. at least 6 unique genes for each type')
bestUnique_6_elifeMetricsDF

## Enrich LUAD and LUSC
<span style="color:red">we added 3,6.,9, or 12 genes to each. Does not improve sensitivity or specificity of LUAD and LUSC.</span>

hypothesis. 
- We did not assign "best" genes in an optimized way.
- LUAD and LUSC are very close. 1vsAll does not find fine grained difference. The 1vsAll are both differentially expressed in similar ways with respect to all the remaining classes. 

In [None]:
enrichLUAD_LUSC_results = [
    "best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_3",
    "best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_6",
    "best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_9",
    "best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_12",
]

outFilePrefix = "enrichLUAD_LUSC_results"
LUAD_LUSC_3_sensitivityDF = metricsRunner(root, outDir, outFilePrefix, enrichLUAD_LUSC_results, 
                       metric='sensitivity', threshold=0.7)

LUAD_LUSC_3_sensitivityDF.iloc[:, colSampleIdx ]

In [None]:
outFilePrefix = "enrichLUAD_LUSC_results_Specificity"
LUAD_LUSC_3_specificityDF = metricsRunner(root, outDir, outFilePrefix, enrichLUAD_LUSC_results, 
                       metric='specificity', threshold=0.7)
LUAD_LUSC_3_specificityDF.iloc[:, colSampleIdx ]

In [None]:
# sensitivity

runName='best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_3'
LUAD_LUSC_3_elifeMetricsDF = evalMetrics( elifeCols, 
                            sensitivityDF=LUAD_LUSC_3_sensitivityDF,  
                            specificityDF=LUAD_LUSC_3_specificityDF,
                            runName=runName,
)
print(f'runName : {runName}')
print(f'elife types best 100 enriched by six degree  1 (i.e. at least 6 unique genes for each type')
print(f'added 3 genes to LUAD and LUSC')
LUAD_LUSC_3_elifeMetricsDF

In [None]:
# sensitivity

# outFilePrefix = "enrichLUAD_LUSC_results_Specificity"
# LUAD_LUSC_3_specificityDF = metricsRunner(root, outDir, outFilePrefix, enrichLUAD_LUSC_results, 
#                        metric='specificity', threshold=0.7)
# LUAD_LUSC_3_specificityDF.iloc[:, colSampleIdx ]

runName='best100Enriched_6_Degree1_selectiveEnrich_LUAD_LUSC_12'
LUAD_LUSC_3_elifeMetricsDF_12 = evalMetrics( elifeCols, 
                            sensitivityDF=LUAD_LUSC_3_sensitivityDF,  
                            specificityDF=LUAD_LUSC_3_specificityDF,
                            runName=runName,
)

print(f'runName : {runName}')
print(f'elife types best 100 enriched by six degree  1 (i.e. at least 6 unique genes for each type')
print(f'added 12 genes to LUAD and LUSC')
LUAD_LUSC_3_elifeMetricsDF