# Create Cibersort Mixture Matrix
```
Andrew E. Davidson
aedavids@ucsc.edu
9/5/2022
```

Create Mixture Matrices we can use to evalute our Gene Signature Profile matrix.

**arguments:**
1. a list of signature genes
2. a gene count matrix
    - The row ids are gene name, the column names are the sample ids
3. a DESeq ColData matrix.
    - contains sample meta data
2. DESeq estimated scaling factors
    - adjust each sample to account for libaray size and library composition
    
**Output:**
1. a mixture matrix file in cibersort format
    - a row for each sample in the gene count matrix.
    - there are no combinations
2. a fractions matrix file
    - The expected fractions matrix represents the linear combinations of signature profiles for a given sample.
    - contains sample meta data
    - can be used with the mixtue matrix to create other test mixture matrices
    - Cibersort will fit a fractions matrix to the mixture matrix. 
    - Our fractions matrix is a ground true label we can use to evaluate how well our gene signature matrix performs
        + for example Cibersort may fix a fractions matrix with a very low RMSE how ever the genes in the linear combination are not found in the actual sample. We have sample that are gender specific to test these kinds of errors

3. A randomized mixture matrice file. 
    - Randominzation remove all information from the mixture matrix. Creating a baseline to evaluate models against. 

**output files:**
- testMixture.txt 3 signature genes, 83 type
    ```
    /private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/best/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/test_mixture.txt

    /private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/best/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/test_expectedFractions.txt
 
    /private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/best/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/test_RandomizedMixture.txt 
    ```

- best GTEx_TCGA_TrainGroupby 832 signature genes, 83 types
    ```
    /private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/best/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/GTEx_TCGA_TrainGroupby_mixture.txt

    /private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/best/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/GTEx_TCGA_TrainGroupby_expectedFractions.txt

    /private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/best/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/GTEx_TCGA_TrainGroupby_RandomizedMixture.txt
    ```

- up regulated GTEx_TCGA_TrainGroupby 1087 signature genes, 83 type
    ```
    /private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/up/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/GTEx_TCGA_TrainGroupby_mixture.txt

    /private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/up/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/GTEx_TCGA_TrainGroupby_expectedFractions.txt

    /private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/up/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/GTEx_TCGA_TrainGroupby_RandomizedMixture.txt
    ```

In [1]:
import numpy as np
import pandas as pd
import pathlib as pl
import time    

# use display() to print an html version of a data frame
# useful if dataFrame output is not generated by last like of cell
from IPython.display import display

In [2]:
LOCAL_CACHE_DIR="/scratch/aedavids/tmp"

def loadCache(source, localCacheDir=LOCAL_CACHE_DIR, verbose=False):
    '''
    reading large files over a NFS mount is slow. loadCache() will
    copy the source file into the local cache if it does not not already exist
    
    arguments:
        source:
            file path
        
        localCacheDir:
            path. Default is global variable LOCAL_CACHE_DIR
            
        verbose:
            if True will print the full local cache path to the file
    '''
    # we can not join, combine source if it start from the root of the file system
    tmpSource = source
    if source[0] == "/":
        tmpSource = source[1:]
    
    localTargetPath = pl.Path(localCacheDir,  tmpSource)
    if verbose:
        print("localTargetPath:\n{}\n".format(localTargetPath))
            
    localTargetPath.parent.mkdir(parents=True, exist_ok=True)

    if not localTargetPath.exists():
        #print("localTargetPath:{} does not exits".format(localTargetPath))
        ! cp $source $localTargetPath 
        
    return localTargetPath
    
def testLoadCache():
    source = "/private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/best/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/testSignatureGenes.txt"
    loadCache( source )
    
testLoadCache()

In [3]:
class CibersortMixtureMatrix(object):
    '''    
    public functions
    
    __init__(self, 
                 todo
                )
                
    getExpectedFractionsDF()
    
    getLabledMixtureDF()  
    
    randomizeMixture(seed=None)

    saveMixtureAndExpectedFractions()
    
    saveRandomizedMixture()
    
    
    todo fraction
    '''
    
    ################################################################################
    def __init__(self, 
                 signatueGeneFilePath,
                 groupByGeneCountFilePath,
                 colDataFilePath,
                 scalingFactorsPath,
                 outdir="ciberSort",
                ):
        '''
        TODO
        
        arguments:
            signatueGeneFilePath:
                path to a tsv file created by createCiberSortGeneSignatureMatrix.ipynb
                Each row in this file coresponds to a 1vsAll DESeq result for a specific gene
                
            groupByGeneCountFilePath:
                path to a csv file with gene counts. 
                ex. 'groupbyGeneTrainingSets/GTEx_TCGA_TrainGroupby.csv'
                
            colDataFilePath:
                path to a csv file containing sample meta data in DESeq format
            
            scalingFactorsPath:
                path to a csv file of DESeq estimated scaling factors used 
                to adjust each sample to account for libaray size and library composition
            
            outdir:
                string
                default = "ciberSort"
                    save() path will be "dirname(signatueGeneFilePath) + "/" + outDir"                
        '''
        self.signatueGeneFilePath = signatueGeneFilePath
        print("\n gene signature file:\n{}".format(self.signatueGeneFilePath))

        self.groupByGeneCountFilePath = groupByGeneCountFilePath
        print("\n groupByGeneCountFilePath:\n{}".format(self.groupByGeneCountFilePath))
        
        self.colDataFilePath = colDataFilePath
        print("\n colDataFilePath:\n{}".format(self.colDataFilePath))

        self.scalingFactorsPath = scalingFactorsPath
        print("\n scalingFactorsPath:\n{}".format(self.scalingFactorsPath))

        self.outdir = outdir
        
        self.metaColList = ['sample_id', 'participant_id', 'category', 'gender', 'age', 'dataSet']


        self.geneSignatureDF = None
        self.groupedByGeneDF = None
        self.colDataDF = None
        self.scalingFactorDF = None
        self._load()
        
        self.geneList = None
        self.mixtureDF = None
        self._select()
                
        self.labeledMixtureDF = None
        self._createLabledMixtue()
        
        self.expectedFractionsDF = None 
        self._createFractions()
    
    ################################################################################
    def getExpectedFractionsDF():
        return self.expectedFractionsDF
    
    ################################################################################
    def getLabledMixtureDF(self):
        return self.labeledMixtureDF
    

    ################################################################################
    def randomizeMixture(self, seed=None):
        '''
        returns a randomly shuffled copy of the mixtureDF. Randomizing data remove all 
        information, creating a good worst case base line you can use to evaluate
        models agains

        arguments:
            seed:
                integer
                set if want the psudo random generator to return the same sequence of
                results. Use for testing purpose purposes only


        ref: 
            - https://github.com/aedavids/lab3RotationProject/blob/master/src/test/testDataFactory.py)]
            - https://github.com/aedavids/lab3RotationProject/blob/master/src/DEMETER2/dataFactory.py
        '''
        if seed:
            np.random.seed(seed)
        else:
            epochTime = int(time.time())
            np.random.seed(epochTime)


        # make a copy to ensure no side effect
        # remove the gene names and meta data. we only want to 
        # shuffle the count values
        copyDF = self.labeledMixtureDF.copy()   
        
        cols = copyDF.columns.to_list()
    
        removedMetaCols = []
        for colName in tcmm.metaColList:
            if colName in cols:
                cols.remove(colName)
                removedMetaCols.append(colName)

    
        copyDF = copyDF.loc[:, cols]        
        removedMetaDF =  self.labeledMixtureDF.loc[:, removedMetaCols] 

        numRows, numCols = copyDF.shape   
        valuesNP = copyDF.values
        for r in range(numRows):
            randomRowIdx = np.random.permutation(numCols)
            #print("randomRowIdx:{}".format(randomRowIdx))
            # use numpy fancy indexing
            valuesNP[r,:] = valuesNP[r, randomRowIdx]

        # create data frame from the scrambled valuesNP
        randomDF = pd.DataFrame(valuesNP, columns=copyDF.columns )

        # add the meta data back
        byColumns=1 # column bind
        retDF = pd.concat([randomDF, removedMetaDF], axis=byColumns)
    
        return retDF
    
    
    ################################################################################
    def saveMixtureAndExpectedFractions(self, outDir, prefixStr=None):
        '''
        saves in a cibersort's expected format
        '''
#         base) $ cat mixture.txt 
#         sampleTitle	S1	S2	S3	S4	S5	S6
#         G1	1.0	0.0	0.0	1.0	1.0	0.0
#         G2	1.0	0.0	0.0	1.0	1.0	0.0
        ciberSortFmtDF = self._convertToMixtureToCiberSortFmt(self.labeledMixtureDF)
        self._save( outDir, ciberSortFmtDF, fileName="mixture.txt", prefixStr=prefixStr)
        
        self._save( outDir, self.expectedFractionsDF, fileName="expectedFractions.txt", prefixStr=prefixStr)
    
    
    ################################################################################
    def _convertToMixtureToCiberSortFmt(self, df):
        '''
        TODO
        '''
        metaList = self.metaColList.copy()
        metaList.remove('sample_id')
        dataCols = df.columns.to_list()
        for m in metaList:
            dataCols.remove(m)

        #print(dataCols)
        mixtureDF = df.loc[:, dataCols]
        #display(mixtureDF.shape)
        #mixtureDF.head()

        #print("\n rename and set index")
        mixtureDF = mixtureDF.rename(columns={"sample_id":"sampleTitle"})
        mixtureDF = mixtureDF.set_index('sampleTitle')
        #print(mixtureDF.shape)
        #display( mixtureDF.head() )

        #print("\n transpose")
        mixtureDF = mixtureDF.transpose()
        #display( mixtureDF.head() )
        
        #print("\n reset index")
        mixtureDF = mixtureDF.reset_index()
        # the original index data is now a column with name 'index'
        mixtureDF = mixtureDF.rename(columns={"index":"sampleTitle"})
        #display( mixtureDF.head() ) 
        
        return mixtureDF

    
    ################################################################################
    def saveRandomizedMixture(self, outDir, randomizedDF, prefixStr=None):
        '''
        saves in a cibersort's expected format
        '''
        ciberSortFmtDF = self._convertToMixtureToCiberSortFmt(randomizedDF)        
        self._save( outDir, ciberSortFmtDF, fileName="RandomizedMixture.txt", prefixStr=prefixStr)    
    
    ################################################################################
    def _save(self, outDir, df, fileName, prefixStr=None):
        '''
        saves in a cibersort's expected format
        
        arguments:
            fileName
                string. example 'mixture.txt'
        '''
        dataOutdir = pl.Path(outDir)

        if prefixStr:
            path = dataOutdir.joinpath(prefixStr + "_" + fileName)
        else:
            path = dataOutdir.joinpath(fileName)   

        path.parent.mkdir(parents=True, exist_ok=True)
        df.to_csv(path, index=False, sep="\t")
        print("\n saved to: {}".format(path))          
        
    ################################################################################
    def _load(self):
        self.geneSignatureDF = pd.read_csv(self.signatueGeneFilePath, sep="\t" )
        print("\ngeneSignatureDF.shape:{}".format(self.geneSignatureDF.shape))
        print("geneSignatureDF.iloc[0:3, :]")
        print(self.geneSignatureDF.iloc[0:3, :])
        
        self.groupedByGeneDF = pd.read_csv(self.groupByGeneCountFilePath, sep=",")
        print("\ngroupByGeneDF.shape:{}".format(self.groupedByGeneDF.shape))
        print("groupByGeneDF.iloc[0:3, 0:3]")
        print(self.groupedByGeneDF.iloc[0:3, 0:3])
        
        self.colDataDF = pd.read_csv(self.colDataFilePath, sep=",")
        print("colDataDF.iloc[0:3, :]")
        print(self.colDataDF.iloc[0:3, :])
        
        self.scalingFactorDF = pd.read_csv(self.scalingFactorsPath, sep=",")
        print("scalingFactorDF.iloc[0:3, :]")
        print(self.scalingFactorDF.iloc[0:3, :])
        
    ################################################################################
    def _select(self):
        self.geneList = self.geneSignatureDF.loc[:, "name"].to_list()
        #oneVsAllDF = oneVsAllDF[ oneVsAllDF.loc[:,"name"].isin( self.geneListsorted) ]
        #oneVsAllDF = oneVsAllDF.sort_values( by=["name"] )
        
        gbDF = self.groupedByGeneDF
        df = gbDF[ gbDF.loc[:,"geneId"].isin(self.geneList) ]
        # sort makes debug easier
        df = df.sort_values( by=["geneId"] )

        # do not change sort order. It shold already match colData order
        
        # rename to match cibersort expected format
        df = df.rename(columns={'geneId':"name"})
        
        # set index will make join after traspose easier
        # if we do not set the index, the 'name' column will become a row
        # instead of the column names
        df = df.set_index('name')
        
        self.mixtureDF = df
        print("\n mixtureDF.iloc[0:3, 0:3]")
        print(self.mixtureDF.iloc[0:3, 0:3])
        

    ################################################################################
    def _createLabledMixtue(self):
        '''
        transpose,
        scale,
        join mixtureDF with colData
        '''
        transposeGroupByDF = self.mixtureDF.transpose(copy=True)
        print("\n transposeGroupByDF.iloc[0:3, 0:3]")
        print(transposeGroupByDF.iloc[0:3, 0:3])

        # normalize counts
        # element wise multiplication . use values to to multiply a vector
        transposeGroupByDF = transposeGroupByDF * self.scalingFactorDF.values
    
        print("\n scaled transposeGroupByDF.iloc[0:3, 0:3]")
        print(transposeGroupByDF.iloc[0:3, 0:3])

        self.labeledMixtureDF =  pd.merge(left=transposeGroupByDF, 
                right=self.colDataDF, 
                how='inner', 
                left_index=True, #left_on="name", 
                right_on="sample_id")

        print("\n labeledMixtureDF.iloc[0:3, 0:3]")
        print(self.labeledMixtureDF.iloc[0:3, 0:3])  
        
        #metaColList = ['sample_id', 'participant_id', 'category', 'gender', 'age', 'dataSet']
        print("\n labeledMixtureDF.iloc[0:3, {}]".format(self.metaColList))
        print(self.labeledMixtureDF.loc[0:3, self.metaColList])        
        
    ################################################################################    
    def _createFractions(self):
        '''
        TODO
        '''
        
        df = self.labeledMixtureDF.loc[:,["category"]].drop_duplicates()\
                    .sort_values(by='category')
        listOfTypes = df["category"].values.tolist()
        print("\n number of types: {}".format(len(listOfTypes)))
        
        # create an empty data frame, and set columns
        self.expectedFractionsDF = pd.DataFrame(columns=['sample_id'] + listOfTypes)
        numTypes = len(listOfTypes)

        for index, row in self.labeledMixtureDF.iterrows():
            sample_id = row['sample_id']
            category = row['category']
            #print("sample_id:{} category:{}".format(sample_id, category))
            idx = listOfTypes.index( category)
            linearCombination = np.zeros(numTypes)
            linearCombination[idx] = 1.0
            # use comprehension to expand numpy array into format that works with pandas
            # this method of append seems slow, it okay for GTEX_TCGA
            # for large data sets consider
            # https://stackoverflow.com/a/48287388
            self.expectedFractionsDF.loc[ len(self.expectedFractionsDF.index) ] = [sample_id]\
                            + [i for i in linearCombination]

        # pre pend the meta data to the extected fractions data frame
        # would have to write a lot more buggy code if we did this in for loop
        #metaColList = ['sample_id', 'participant_id', 'category', 'gender', 'age', 'dataSet']
        metaDF = self.labeledMixtureDF.loc[:, self.metaColList]
        self.expectedFractionsDF = pd.merge(left=metaDF, 
                                            right=self.expectedFractionsDF, 
                                            how='inner', 
                                            left_on="sample_id",
                                            right_on="sample_id")

        print("\n expectedFractionsDF.iloc[0:3, 0:10]")
        print(self.expectedFractionsDF.iloc[0:3, 0:10])  

## Test the CibersortMixtureMatrix Class

In [4]:
%%time
def testCibersortMixtureMatrix():
    # common file paths
    rootDir = "/private/groups/kimlab/GTEx_TCGA"
    groupByDataDir = rootDir + "/groupbyGeneTrainingSets"
    geneSignatureProfilesDir = rootDir + "/geneSignatureProfiles"
    
    # path to gene signature file
    hypothesis = "best"
    dataSet = "GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25"
    #GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25
    signatureFile = "ciberSort/testSignatureGenes.txt"
    testGeneSignatureFile = geneSignatureProfilesDir \
                            + "/" + hypothesis \
                            + "/" + dataSet \
                            + "/" + signatureFile
    
    
    # reading over NFS is slow, cache local
    localGeneSigFile = loadCache(testGeneSignatureFile)
    
    # path to gene count file
    trainGroupByGeneCountFilePath = groupByDataDir + "/GTEx_TCGA_TrainGroupby.csv"    
    groupByGeneCountFilePath = loadCache(trainGroupByGeneCountFilePath)

    trainingColDataFilePath = groupByDataDir + "/GTEx_TCGA_TrainColData.csv"
    colDataFilePath = loadCache(trainingColDataFilePath)

    #/private/groups/kimlab/GTEx_TCGA/1vsAll/estimatedSizeFactors.csv
    oneVsAllDataDir = rootDir + "/1vsAll"    
    estimatedScalingFactorsFilePath = oneVsAllDataDir + "/estimatedSizeFactors.csv"
    scalingFactorsPath = loadCache(estimatedScalingFactorsFilePath, verbose=True)
    
    tcmm = CibersortMixtureMatrix(
        localGeneSigFile,
        groupByGeneCountFilePath,
        colDataFilePath,
        scalingFactorsPath
    )
    
    return tcmm
    
tcmm = testCibersortMixtureMatrix()

localTargetPath:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/1vsAll/estimatedSizeFactors.csv


 gene signature file:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/best/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/testSignatureGenes.txt

 groupByGeneCountFilePath:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/groupbyGeneTrainingSets/GTEx_TCGA_TrainGroupby.csv

 colDataFilePath:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/groupbyGeneTrainingSets/GTEx_TCGA_TrainColData.csv

 scalingFactorsPath:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/1vsAll/estimatedSizeFactors.csv

geneSignatureDF.shape:(3, 4)
geneSignatureDF.iloc[0:3, :]
         name       ACC  Adipose_Subcutaneous  Adipose_Visceral_Omentum
0  AC010329.1 -9.430255             -5.666250                 -8.521738
1  AC013391.3 -4.238785             -8.850412                 -8.152083
2  AC092720.2  0.216492             -8.516456     

In [5]:
def testExpectedFractions(tcmm):
    '''
    sanity test: check shapes
    '''
    #display(tcmm.expectedFractionsDF)
    dataColumns = tcmm.expectedFractionsDF.columns.to_list()
    metaColList = tcmm.metaColList.copy()
    for m in metaColList:
        dataColumns.remove(m)
    print("tcmm.expectedFractionsDF.shape:{}".format(tcmm.expectedFractionsDF.shape))
    
    print("tcmm.groupedByGeneDF.shape:{}".format(tcmm.groupedByGeneDF.shape))

    
    # groupedByGeneDF has an geneId column
    expectedNumSamples = tcmm.groupedByGeneDF.shape[1] - 1
    numSamples = tcmm.expectedFractionsDF.shape[0]
    msg = "expected num samples:{} actual num samples:{}".format(expectedNumSamples, numSamples)
    assert expectedNumSamples == numSamples, msg
    
    expectedSum = expectedNumSamples
    actualSum = tcmm.expectedFractionsDF.loc[:, dataColumns].to_numpy().sum()
    msg = "expected sum:{} actual sum:{}".format(expectedSum, actualSum)
    assert expectedSum == actualSum, msg

testExpectedFractions(tcmm)

tcmm.expectedFractionsDF.shape:(15801, 89)
tcmm.groupedByGeneDF.shape:(74777, 15802)


In [6]:
%%time
def testCibersortMixtureMatrixAssert(tcmm):
    expectedDF_before_scaling = pd.DataFrame(
        {'geneId': {17386: 'AC010329.1', 18356: 'AC013391.3', 23516: 'AC092720.2'},
     'GTEX-1117F-0226-SM-5GZZ7': {17386: 69, 18356: 0, 23516: 0},
     'GTEX-1117F-0526-SM-5EGHJ': {17386: 0, 18356: 0, 23516: 0},
     'GTEX-1117F-0726-SM-5GIEN': {17386: 1, 18356: 0, 23516: 0}}
    )
    
    #expectedDict = tcmm.getLabledMixtureDF().iloc[0:4,0:4].to_dict()
    expectedDict = {
        'AC010329.1': {0: 57.00542333223874, 1: 0.0, 2: 0.532358863795858, 3: 160.86533930625043}, 
        'AC013391.3': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0}, 
        'AC092720.2': {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0}, 
        'sample_id': {0: 'GTEX-1117F-0226-SM-5GZZ7', 1: 'GTEX-1117F-0526-SM-5EGHJ', 2: 'GTEX-1117F-0726-SM-5GIEN', 3: 'GTEX-1117F-2826-SM-5GZXL'}}

 
    expectedDF = pd.DataFrame( expectedDict )
    #expectedDF.index.name = "name"

    testDF = tcmm.getLabledMixtureDF().iloc[0:4,0:4]

    print("expectedDF")
    display(expectedDF)
    print("\n testDF")
    display(testDF)

    pd.testing.assert_frame_equal(expectedDF, testDF)

testCibersortMixtureMatrixAssert(tcmm)

expectedDF


Unnamed: 0,AC010329.1,AC013391.3,AC092720.2,sample_id
0,57.005423,0.0,0.0,GTEX-1117F-0226-SM-5GZZ7
1,0.0,0.0,0.0,GTEX-1117F-0526-SM-5EGHJ
2,0.532359,0.0,0.0,GTEX-1117F-0726-SM-5GIEN
3,160.865339,0.0,0.0,GTEX-1117F-2826-SM-5GZXL



 testDF


Unnamed: 0,AC010329.1,AC013391.3,AC092720.2,sample_id
0,57.005423,0.0,0.0,GTEX-1117F-0226-SM-5GZZ7
1,0.0,0.0,0.0,GTEX-1117F-0526-SM-5EGHJ
2,0.532359,0.0,0.0,GTEX-1117F-0726-SM-5GIEN
3,160.865339,0.0,0.0,GTEX-1117F-2826-SM-5GZXL


CPU times: user 24.6 ms, sys: 3.06 ms, total: 27.7 ms
Wall time: 26.7 ms


In [7]:
def testSaveMixture(tcmm):
    outdirStr = str(tcmm.signatueGeneFilePath.parent)
    prefixLen = len(LOCAL_CACHE_DIR)
    outDir = outdirStr[prefixLen:]
    tcmm.saveMixtureAndExpectedFractions(outDir=outDir, prefixStr="test")
              
testSaveMixture(tcmm)


 saved to: /private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/best/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/test_mixture.txt

 saved to: /private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/best/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/test_expectedFractions.txt


In [8]:
def testRandomizeMixture(tcmm):
    meaningOfLife = 42
    randomizedDF = tcmm.randomizeMixture(seed=meaningOfLife)
    print("randomizedDF.shape:{}".format(randomizedDF.shape))
    display(randomizedDF.iloc[0:4, 0:4])
    
    # https://www.statology.org/pandas-exclude-column/
    randomizedDF = randomizedDF.loc[:, ~randomizedDF.columns.isin(tcmm.metaColList)]
    print("randomizedDF.shape:{}".format(randomizedDF.shape))
    
    byRow=1
    randomSumList = randomizedDF.sum(axis=byRow).to_list()
    print("randomSumList.sum[0:5]:\n{}".format(randomSumList[0:5]))
    
    mixtureDF = tcmm.getLabledMixtureDF()
    mixtureDF = mixtureDF.loc[:, ~mixtureDF.columns.isin(tcmm.metaColList)]
    print("\nmixtureDF.shape:{}".format(mixtureDF.shape))
    print("mixtureDF.iloc[:, 0:4]")
    display(mixtureDF.iloc[0:4, 0:4])    
    
    mixtureSumList = mixtureDF.sum(axis=byRow).to_list()
    print("mixtureSumList.sum[0:5]:\n{}".format(mixtureSumList[0:5]))
        
    np.testing.assert_allclose(randomSumList, mixtureSumList)
    
    
testRandomizeMixture(tcmm)  

randomizedDF.shape:(15801, 9)


Unnamed: 0,AC010329.1,AC013391.3,AC092720.2,sample_id
0,57.005423,0.0,0.0,GTEX-1117F-0226-SM-5GZZ7
1,0.0,0.0,0.0,GTEX-1117F-0526-SM-5EGHJ
2,0.532359,0.0,0.0,GTEX-1117F-0726-SM-5GIEN
3,0.0,0.0,160.865339,GTEX-1117F-2826-SM-5GZXL


randomizedDF.shape:(15801, 3)
randomSumList.sum[0:5]:
[57.00542333223874, 0.0, 0.532358863795858, 160.86533930625043, 204.91007979923864]

mixtureDF.shape:(15801, 3)
mixtureDF.iloc[:, 0:4]


Unnamed: 0,AC010329.1,AC013391.3,AC092720.2
0,57.005423,0.0,0.0
1,0.0,0.0,0.0
2,0.532359,0.0,0.0
3,160.865339,0.0,0.0


mixtureSumList.sum[0:5]:
[57.00542333223874, 0.0, 0.532358863795858, 160.86533930625043, 204.91007979923864]


In [11]:
def testSaveRandomMixture(tcmm):
    outdirStr = str(tcmm.signatueGeneFilePath.parent)
    prefixLen = len(LOCAL_CACHE_DIR)
    outDir = outdirStr[prefixLen:]
    randomizedDF = tcmm.randomizeMixture()
    tcmm.saveRandomizedMixture(outDir, randomizedDF, prefixStr="test")

testSaveRandomMixture(tcmm)


 saved to: /private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/best/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/test_RandomizedMixture.txt


## Create Best mixture matrix

In [21]:
%%time
def createBestMixtureMatrix():
    # common file paths
    rootDir = "/private/groups/kimlab/GTEx_TCGA"
    groupByDataDir = rootDir + "/groupbyGeneTrainingSets"
    geneSignatureProfilesDir = rootDir + "/geneSignatureProfiles"

    # path to gene signature file
    hypothesis = "best"
    dataSet = "GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25"
    #GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25
    signatureFile = "ciberSort/signatureGenes.txt"
    bestGeneSignatureFile = geneSignatureProfilesDir \
                            + "/" + hypothesis \
                            + "/" + dataSet \
                            + "/" + signatureFile

    p = pl.Path(bestGeneSignatureFile)
    if not p.exists():
        print("ERROR: file not found\n{}".format(p))
    
    # reading over NFS is slow, cache local
    localGeneSigFile = loadCache(bestGeneSignatureFile)

    # path to gene count file
    trainGroupByGeneCountFilePath = groupByDataDir + "/GTEx_TCGA_TrainGroupby.csv"    
    groupByGeneCountFilePath = loadCache(trainGroupByGeneCountFilePath)

    trainingColDataFilePath = groupByDataDir + "/GTEx_TCGA_TrainColData.csv"
    colDataFilePath = loadCache(trainingColDataFilePath)

    #/private/groups/kimlab/GTEx_TCGA/1vsAll/estimatedSizeFactors.csv
    oneVsAllDataDir = rootDir + "/1vsAll"    
    estimatedScalingFactorsFilePath = oneVsAllDataDir + "/estimatedSizeFactors.csv"
    scalingFactorsPath = loadCache(estimatedScalingFactorsFilePath, verbose=True)
     
    bestCMM = CibersortMixtureMatrix(
        localGeneSigFile,
        groupByGeneCountFilePath,
        colDataFilePath,
        scalingFactorsPath
    )
    
    # save
    outdirStr = str(bestCMM.signatueGeneFilePath.parent)
    prefixLen = len(LOCAL_CACHE_DIR)
    outDir = outdirStr[prefixLen:]
    
    bestCMM.saveMixtureAndExpectedFractions(outDir, prefixStr="GTEx_TCGA_TrainGroupby")
    
    randomizedDF = bestCMM.randomizeMixture()
    bestCMM.saveRandomizedMixture(outDir, randomizedDF, prefixStr="GTEx_TCGA_TrainGroupby")

    return bestCMM

bestCMM = createBestMixtureMatrix()

localTargetPath:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/1vsAll/estimatedSizeFactors.csv


 gene signature file:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/best/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/signatureGenes.txt

 groupByGeneCountFilePath:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/groupbyGeneTrainingSets/GTEx_TCGA_TrainGroupby.csv

 colDataFilePath:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/groupbyGeneTrainingSets/GTEx_TCGA_TrainColData.csv

 scalingFactorsPath:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/1vsAll/estimatedSizeFactors.csv

geneSignatureDF.shape:(832, 84)
geneSignatureDF.iloc[0:3, :]
         name        ACC  Adipose_Subcutaneous  Adipose_Visceral_Omentum  \
0   (AGTGCC)n  -3.647485             -0.256342                 -0.178603   
1  (CAGAGGC)n -30.000000             -2.067227                 -1.619289   
2     (CCAG)n  -2.334555             -0

## Create Up regulated mixture matrix

In [22]:
%%time
def createUpMixtureMatrix():
    # common file paths
    rootDir = "/private/groups/kimlab/GTEx_TCGA"
    groupByDataDir = rootDir + "/groupbyGeneTrainingSets"
    geneSignatureProfilesDir = rootDir + "/geneSignatureProfiles"

    # path to gene signature file
    hypothesis = "up"
    dataSet = "GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25"
    #GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25
    signatureFile = "ciberSort/signatureGenes.txt"
    upGeneSignatureFile = geneSignatureProfilesDir \
                            + "/" + hypothesis \
                            + "/" + dataSet \
                            + "/" + signatureFile

    p = pl.Path(upGeneSignatureFile)
    if not p.exists():
        print("ERROR: file not found\n{}".format(p))
    
    # reading over NFS is slow, cache local
    localGeneSigFile = loadCache(upGeneSignatureFile)

    # path to gene count file
    trainGroupByGeneCountFilePath = groupByDataDir + "/GTEx_TCGA_TrainGroupby.csv"    
    groupByGeneCountFilePath = loadCache(trainGroupByGeneCountFilePath)
    
    trainingColDataFilePath = groupByDataDir + "/GTEx_TCGA_TrainColData.csv"
    colDataFilePath = loadCache(trainingColDataFilePath)
    
    #/private/groups/kimlab/GTEx_TCGA/1vsAll/estimatedSizeFactors.csv
    oneVsAllDataDir = rootDir + "/1vsAll"    
    estimatedScalingFactorsFilePath = oneVsAllDataDir + "/estimatedSizeFactors.csv"
    scalingFactorsPath = loadCache(estimatedScalingFactorsFilePath, verbose=True)
 
    upCMM = CibersortMixtureMatrix(
        localGeneSigFile,
        groupByGeneCountFilePath,
        colDataFilePath,
        scalingFactorsPath
    )
    
    # save
    outdirStr = str(upCMM.signatueGeneFilePath.parent)
    prefixLen = len(LOCAL_CACHE_DIR)
    outDir = outdirStr[prefixLen:]
    
    upCMM.saveMixtureAndExpectedFractions(outDir, prefixStr="GTEx_TCGA_TrainGroupby")
    
    randomizedDF = upCMM.randomizeMixture()
    upCMM.saveRandomizedMixture(outDir, randomizedDF, prefixStr="GTEx_TCGA_TrainGroupby")

    return upCMM

upCMM = createUpMixtureMatrix()

localTargetPath:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/1vsAll/estimatedSizeFactors.csv


 gene signature file:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/geneSignatureProfiles/up/GTEx_TCGA_1vsAll-design:~__gender_+_category-padj:0.001-lfc:2.0-n:25/ciberSort/signatureGenes.txt

 groupByGeneCountFilePath:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/groupbyGeneTrainingSets/GTEx_TCGA_TrainGroupby.csv

 colDataFilePath:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/groupbyGeneTrainingSets/GTEx_TCGA_TrainColData.csv

 scalingFactorsPath:
/scratch/aedavids/tmp/private/groups/kimlab/GTEx_TCGA/1vsAll/estimatedSizeFactors.csv

geneSignatureDF.shape:(1087, 84)
geneSignatureDF.iloc[0:3, :]
    name       ACC  Adipose_Subcutaneous  Adipose_Visceral_Omentum  \
0  (TG)n  2.232453              0.215499                  0.528656   
1    A2M  0.110606              0.547254                  0.735236   
2  A2ML1 -3.980913             -6.712909                 -