# Verify CIBERSORTxFractionsWorkflow
```
Andrew Davidson
aedavids@ucsc.edu
3/28/31
```
verify scatter/gather produces same results as run with entire mixture file

**Abstract**

a mixture matrix with 15,801 was run three different ways
- using Standford's docker
- use scatter/gather, with at most  500 samples in a partition 
- use scatter/gather, with at most 1000 samples in a partition 

The fractions, R, and RMSE are identical

The P-values varry. This is probably because the p-value is calculate using a monte-carlo simulation and we can not set the random see

In [1]:
import numpy as np
import pandas as pd
import pathlib as pl

In [2]:
# ran Stanford docker on 2022-10-18-07 took 83:02 hrs. 
dockerRootPL = pl.Path("/private/groups/kimlab/GTEx_TCGA/cibersort.out/GTEx_TCGA_TrainGroupby_mixture")
dockerResultsPL =  dockerRootPL.joinpath("CIBERSORTx_GTEx_TCGA_TrainGroupby_mixture_Results.txt")

**/scratch/aedavids/CIBERSORTxFractionsWorkflow**
extraCellularRNA/terra/cibersortx/wdl/runCibersortxFractionsTask.sh

2 runs with with numSamplesInPartition = 500 the other = 1000

```
$ cat CIBERSORTxFractionsWorkflow.wdl.input.json
{
  "CIBERSORTxFractionsWorkflow.sigmatrix": "/scratch/aedavids/CIBERSORTxFractionsWorkflow/wdl/geneSignatureProfiles/best/tmp/signatureGenes.tsv",
  "CIBERSORTxFractionsWorkflow.QN": "false",
  "CIBERSORTxFractionsWorkflow.verbose": "true",
  "CIBERSORTxFractionsWorkflow.token": "3f561ab6d4cf373d11f23d8e205b4b72",
  "CIBERSORTxFractionsWorkflow.username":  "aedavids@ucsc.edu",
  "CIBERSORTxFractionsWorkflow.perm": "100",
  "CIBERSORTxFractionsWorkflow.label": "fraction",
  "CIBERSORTxFractionsWorkflow.mixture":  "/scratch/aedavids/CIBERSORTxFractionsWorkflow/wdl/geneSignatureProfiles/best/tmp/GTEx_TCGA_TrainGroupby_mixture.txt",
  "CIBERSORTxFractionsWorkflow.numSamplesInPartition": "500",
  "CIBERSORTxFractionsWorkflow.isCSV": "false"
}
```

In [3]:
wfRootPl = pl.Path("/scratch/aedavids/CIBERSORTxFractionsWorkflow/wdl/")
parts500Pl = wfRootPl.joinpath("numSamplesInPartition500.output/results.txt")
parts1000Pl = wfRootPl.joinpath("numSamplesInPartition1000.output/results.txt")

# check headers

In [4]:
dockerDF = pd.read_csv(dockerResultsPL, sep="\t")
print(f'dockerDF.shape : {dockerDF.shape}')
dockerDF_columns = dockerDF.columns

dockerDF.shape : (15801, 87)


In [5]:
parts500DF = pd.read_csv(parts500Pl, sep="\t")
print(f'parts500DF.shape : {parts500DF.shape}')
parts500DF_columns = parts500DF.columns

parts500DF.shape : (15801, 87)


In [6]:
parts1000DF = pd.read_csv(parts1000Pl, sep="\t")
print(f'parts1000DF.shape : {parts1000DF.shape}')
partsDF1000_columns = parts1000DF.columns

parts1000DF.shape : (15801, 87)


In [7]:
assert (dockerDF_columns == parts500DF_columns).all(), "ERROR docker and parts500 have columns are different"

In [8]:
assert (dockerDF_columns == partsDF1000_columns).all(), "ERROR docker and parts1000 have columns are different"

# check sample names

In [9]:
dockerSampleNamesNP    = dockerDF.loc[:   , 'Mixture'].values
parts500SampleNamesNP  = parts500DF.loc[: , 'Mixture'].values
parts1000SampleNamesNP = parts1000DF.loc[:, 'Mixture'].values

In [10]:
emsg = "ERROR docker and 500 sample names are different"
np.testing.assert_array_equal(dockerSampleNamesNP, parts500SampleNamesNP, err_msg=emsg,  verbose=True)

In [11]:
emsg = "ERROR docker and 1000 sample names are different"
np.testing.assert_array_equal(dockerSampleNamesNP, parts1000SampleNamesNP, err_msg=emsg,  verbose=True)

# check fractions
compare numpy test performance to pandas

In [12]:
cols = dockerDF_columns[1:-3].values.tolist()
dockerFractionsDF = dockerDF.loc[: , cols]
#dockerFractionsDF.head()

In [13]:
cols = parts500DF_columns[1:-3].values.tolist()
parts500FractionsDF = parts500DF.loc[: , cols]

In [14]:
cols = partsDF1000_columns[1:-3].values.tolist()
parts1000FractionsDF = parts1000DF.loc[: , cols]

In [15]:
%time
emsg = "ERROR dockerFractionsDF and parts500FractionsDF differ"
np.testing.assert_array_equal(dockerFractionsDF.values, parts500FractionsDF.values, err_msg=emsg, verbose=True)

CPU times: user 5 µs, sys: 12 µs, total: 17 µs
Wall time: 31 µs


In [16]:
emsg = "ERROR dockerFractionsDF and parts1000FractionsDF differ"
np.testing.assert_array_equal(dockerFractionsDF.values, parts1000FractionsDF.values, err_msg=emsg, verbose=True)

# check stats
p-values are calculated using monte carlo simulation. Good change they will diff

In [17]:
cols = dockerDF_columns[-2:].values.tolist()
dockerStatsDF = dockerDF.loc[: , ['Correlation', 'RMSE']]
dockerStatsDF.head()

Unnamed: 0,Correlation,RMSE
0,0.985425,0.926455
1,0.979791,0.934695
2,0.984906,0.447916
3,0.988464,0.907675
4,0.917052,0.949553


In [18]:
cols = parts500DF_columns[-2:].values.tolist()
parts500StatsDF = parts500DF.loc[: , ['Correlation', 'RMSE']]

In [19]:
cols = partsDF1000_columns[-2:].values.tolist()
parts1000StatsDF = parts1000DF.loc[: , ['Correlation', 'RMSE']]

In [20]:
pd.testing.assert_frame_equal(dockerStatsDF, parts500StatsDF)

In [21]:
pd.testing.assert_frame_equal(dockerStatsDF, parts1000StatsDF)

# P-values are off
CIBERSORTx does not provide a way to set the random seed so the monte carlo simulation results may vary

In [22]:
#np.testing.assert_array_equal(dockerDF.loc[:, "P-value"].values, parts500DF.loc[: ,"P-value"].values )
np.testing.assert_allclose(dockerDF.loc[:, "P-value"].values, parts500DF.loc[: ,"P-value"].values, atol=0.02 )

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0.02

Mismatched elements: 265 / 15801 (1.68%)
Max absolute difference: 0.11
Max relative difference: 3.
 x: array([0.  , 0.  , 0.  , ..., 0.01, 0.  , 0.  ])
 y: array([0.  , 0.  , 0.  , ..., 0.01, 0.  , 0.  ])

In [23]:
np.testing.assert_allclose(dockerDF.loc[:, "P-value"].values, parts1000DF.loc[: ,"P-value"].values,  atol=0.02)

AssertionError: 
Not equal to tolerance rtol=1e-07, atol=0.02

Mismatched elements: 51 / 15801 (0.323%)
Max absolute difference: 0.06
Max relative difference: 3.
 x: array([0.  , 0.  , 0.  , ..., 0.01, 0.  , 0.  ])
 y: array([0.  , 0.  , 0.  , ..., 0.02, 0.01, 0.  ])